TL;DR
- Insight:
- Many decision-making tasks need concise video summaries to save time and reduce cognitive load, but current vision-language models often generate verbose, less useful outputs.
- Effective video understanding relies on concise, relevant captions, but this information is often diluted in lengthy or unfocused summaries.
- VIBE evaluates and selects video-to-text summaries without requiring human
annotations by leveraging two scores:
- Grounding → Fidelity of the summary anchored to the video.
- Utility → Relevance of the summary to any downstream task.
- Our human study validated on LearningPaper24, SUTD-TrafficQA, and LongVideoBench, where VIBE selected summaries of videos can boost task accuracy by +61.2% while reducing human response time by -75.8%.
Why Annotation-Free Evaluation?
Evaluating video summaries is challenging:
- Human annotations are costly and slow to collect.
- Gold-standard summaries often fail to capture the diversity of valid outputs.
- VLMs generate plausible text that may not be faithful to the video or useful for downstream tasks.
The VIBE Evaluation Framework
What Is the Key Idea Behind VIBE?
VIBE is inspired by the Information Bottleneck (IB) principle . In IB, a good representation should compress input data while preserving information relevant to the target output. Similarly, VIBE treats the textual summary as the intermediate representation between the input video and the downstream task, and evaluates summaries through two scores:
- Grounding → Fidelity of the summary anchored to the video.
- Utility → Relevance of the summary to the task.
We aim to select summaries that optimize for these two scores.

How Does VIBE Work?
- Generate a few candidate summaries with a VLM.
- Compute Grounding and Utility Scores via multimodal masking and pointwise mutual information.
- Rank with rejection sampling to pick the best trade-off.

VIBE = Multimodal Masking + VLM Next Token Prediction
VIBE adapts the information bottleneck principle to video summarization:
- Grounding: Approximates $I(V;T)$, testing how well the video supports reconstruction of a masked summary.
- Utility: Approximates $I(T;Y)$, testing how well the summary compensates for missing video information to predict a task label.
By maximizing both, VIBE selects concise, task-relevant summaries—without gold labels or retraining.

What Do Experiments Show?
Pareto Optimality Across Datasets
We evaluate VIBE on three diverse datasets:
LearningPaper24
(a self-curated collection of OpenReview talks),
SUTD-TrafficQA
(traffic accident videos),
and LongVideoBench
(long Internet videos).
By plotting grounding against utility scores, we consistently observe a
Pareto frontier that highlights the trade-off and
Summaries selected by VIBE lie on this frontier, outperforming both random selection and naïve VLM
outputs. On the LearningPaper24
dataset,
where human-written TL;DRs are available, the VIBE Pareto curve even
dominates human annotations.

Human Studies
In user studies with 243 participants, VIBE-selected summaries significantly improved both QA accuracy and response time compared to naïve VLM outputs and watching raw video.
In the figure, points closer to the upper-right corner indicate a better balance of speed and accuracy. Both Max-Utility (Max-U) and Max-Grounding (Max-G) lie above naïve VLM summaries, achieving higher accuracy, and to the right of raw video consumption, achieving faster responses. Correlation analysis further shows a strong positive correlation between the utility score and human QA accuracy.


Conclusion
VIBE offers an annotation-free framework for evaluating video-to-text summaries by balancing grounding and utility scores. It selects summaries that enhance human decision-making across diverse datasets and tasks, outperforming both naïve VLM outputs and even human-written summaries in some cases.
BibTeX
@misc{chen2025vibevideototextinformationbottleneck,
title={VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR},
author={Shenghui Chen and Po-han Li and Sandeep Chinchali and Ufuk Topcu},
year={2025},
eprint={2505.17423},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.17423},
}