VIBE logo

VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR

The University of Texas at Austin
NeurIPS 2025

*Indicates Equal Contribution

TL;DR

  • Insight:
    • Many decision-making tasks need concise video summaries to save time and reduce cognitive load, but current vision-language models often generate verbose, less useful outputs.
    • Effective video understanding relies on concise, relevant captions, but this information is often diluted in lengthy or unfocused summaries.
  • VIBE evaluates and selects video-to-text summaries without requiring human annotations by leveraging two scores:
    • Grounding → Fidelity of the summary anchored to the video.
    • Utility → Relevance of the summary to any downstream task.
  • Our human study validated on LearningPaper24, SUTD-TrafficQA, and LongVideoBench, where VIBE selected summaries of videos can boost task accuracy by +61.2% while reducing human response time by -75.8%.

Why Annotation-Free Evaluation?

Evaluating video summaries is challenging:

  • Human annotations are costly and slow to collect.
  • Gold-standard summaries often fail to capture the diversity of valid outputs.
  • VLMs generate plausible text that may not be faithful to the video or useful for downstream tasks.

The VIBE Evaluation Framework

What Is the Key Idea Behind VIBE?

VIBE is inspired by the Information Bottleneck (IB) principle . In IB, a good representation should compress input data while preserving information relevant to the target output. Similarly, VIBE treats the textual summary as the intermediate representation between the input video and the downstream task, and evaluates summaries through two scores:

  • Grounding → Fidelity of the summary anchored to the video.
  • Utility → Relevance of the summary to the task.

We aim to select summaries that optimize for these two scores.

VIBE Problem Plot

How Does VIBE Work?

  1. Generate a few candidate summaries with a VLM.
  2. Compute Grounding and Utility Scores via multimodal masking and pointwise mutual information.
  3. Rank with rejection sampling to pick the best trade-off.
VIBE System Plot

VIBE = Multimodal Masking + VLM Next Token Prediction

VIBE adapts the information bottleneck principle to video summarization:

  • Grounding: Approximates $I(V;T)$, testing how well the video supports reconstruction of a masked summary.
  • Utility: Approximates $I(T;Y)$, testing how well the summary compensates for missing video information to predict a task label.

By maximizing both, VIBE selects concise, task-relevant summaries—without gold labels or retraining.

VIBE Method Plot

What Do Experiments Show?

Pareto Optimality Across Datasets

We evaluate VIBE on three diverse datasets: LearningPaper24 (a self-curated collection of OpenReview talks), SUTD-TrafficQA (traffic accident videos), and LongVideoBench (long Internet videos).

By plotting grounding against utility scores, we consistently observe a Pareto frontier that highlights the trade-off and Summaries selected by VIBE lie on this frontier, outperforming both random selection and naïve VLM outputs. On the LearningPaper24 dataset, where human-written TL;DRs are available, the VIBE Pareto curve even dominates human annotations.

Pareto Curve

Human Studies

In user studies with 243 participants, VIBE-selected summaries significantly improved both QA accuracy and response time compared to naïve VLM outputs and watching raw video.

In the figure, points closer to the upper-right corner indicate a better balance of speed and accuracy. Both Max-Utility (Max-U) and Max-Grounding (Max-G) lie above naïve VLM summaries, achieving higher accuracy, and to the right of raw video consumption, achieving faster responses. Correlation analysis further shows a strong positive correlation between the utility score and human QA accuracy.

Human Study Results
Performance Table

Conclusion

VIBE offers an annotation-free framework for evaluating video-to-text summaries by balancing grounding and utility scores. It selects summaries that enhance human decision-making across diverse datasets and tasks, outperforming both naïve VLM outputs and even human-written summaries in some cases.

BibTeX

@misc{chen2025vibevideototextinformationbottleneck,
      title={VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR}, 
      author={Shenghui Chen and Po-han Li and Sandeep Chinchali and Ufuk Topcu},
      year={2025},
      eprint={2505.17423},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.17423}, 
}