VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR

Chen, Shenghui; Li, Po-han; Chinchali, Sandeep; Topcu, Ufuk

VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR

Shenghui Chen^*, Po-han Li^*, Sandeep Chinchali, Ufuk Topcu

The University of Texas at Austin
NeurIPS 2025
^*Indicates Equal Contribution

OpenReview arXiv Dataset (LearningPaper24) Code

TL;DR

Many decision-making tasks need concise video summaries to save time and reduce cognitive load, but current vision-language models often generate verbose, less useful outputs.
Effective video understanding relies on concise, relevant captions, but this information is often diluted in lengthy or unfocused summaries.
VIBE evaluates and selects summaries without requiring human annotations by leveraging two scores:
- Grounding → Fidelity of the summary anchored to the video.
- Utility → Relevance of the summary to any downstream task.
Our human study validated on LearningPaper24, SUTD-TrafficQA, and LongVideoBench, where VIBE selected summaries can boost task accuracy by +61.2% while reducing human response time by -75.8%.

Why Annotation-Free Evaluation?

Evaluating video summaries is challenging:

Human annotations are costly and slow to collect.
Gold-standard summaries often fail to capture the diversity of valid outputs.
VLMs generate plausible text that may not be faithful to the video or useful for downstream tasks.

The VIBE Evaluation Framework

What Is the Key Idea Behind VIBE?

VIBE is inspired by the Information Bottleneck (IB) principle . In IB, a good representation should compress input data while preserving information relevant to the target output. Similarly, VIBE treats the textual summary as the intermediate representation between the input video and the downstream task, and evaluates summaries through two scores:

Grounding → Fidelity of the summary anchored to the video.
Utility → Relevance of the summary to the task.

We aim to select summaries that optimize for these two scores.

How Does VIBE Work?

Generate a few candidate summaries with a VLM.
Compute Grounding and Utility Scores via multimodal masking and pointwise mutual information.
Rank with rejection sampling to pick the best trade-off.

VIBE = Multimodal Masking + VLM Next Token Prediction

VIBE adapts the information bottleneck principle to video summarization:

Grounding: Approximates $I(V;T)$, testing how well the video supports reconstruction of a masked summary.
Utility: Approximates $I(T;Y)$, testing how well the summary compensates for missing video information to predict a task label.

By maximizing both, VIBE selects concise, task-relevant summaries—without gold standard or ground truth labels.

What Do Experiments Show?

Pareto Optimality Across Datasets

We evaluate VIBE on three diverse datasets: LearningPaper24 (a self-curated collection of OpenReview talks), SUTD-TrafficQA (traffic accident videos), and LongVideoBench (long Internet videos).

By plotting grounding against utility scores, we consistently observe a Pareto frontier that highlights the trade-off and Summaries selected by VIBE lie on this frontier, outperforming both random selection and naïve VLM outputs. On the LearningPaper24 dataset, where human-written TL;DRs are available, the VIBE Pareto curve even dominates human annotations.

Human Studies

In user studies with 243 participants, VIBE-selected summaries significantly improved both QA accuracy and response time compared to naïve VLM outputs and watching raw video.

In the figure, points closer to the upper-right corner indicate a better balance of speed and accuracy. Both Max-Utility (Max-U) and Max-Grounding (Max-G) lie above naïve VLM summaries, achieving higher accuracy, and to the right of raw video consumption, achieving faster responses. Correlation analysis further shows a strong positive correlation between the utility score and human QA accuracy.

Conclusion

VIBE offers an annotation-free framework for evaluating video-to-text summaries by balancing grounding and utility scores. It selects summaries that enhance human decision-making across diverse datasets and tasks, outperforming both naïve VLM outputs and even human-written summaries in some cases.

BibTeX

@inproceedings{chen2025vibevideototextinformationbottleneck,
  title={VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for {TL};{DR}},
  author={Shenghui Chen and Po-han Li and Sandeep Chichali and Ufuk Topcu},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=C35FCYZBXp}
}