Abstract
Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01× FLOPs reduction and 3.10× prefilling speedup over the strongest baseline while still improving performance.
EventBench Statistics
Top: A sunburt and toy example of EventBench. Bottom: Statistics of EventBench including the distributions of video lengths and capture platforms.
Task Examples
Visualization on EventBench. Random examples from counting, ordering, action recognition, action prediction, action retrospection, object recognition, object attribute, and object location task are shown. Ground-truth answers are bolded, while yellow and green boxes highlight visual content relevant to the questions. (Zoom in for best view)
Event-Guided Video Understanding
Overview of EventSTU. It processes long videos through a sequential, multi-stage pipeline. First, coarse-to-fine sampling efficiently filters redundant frames based on event density and retrieves question-relevant keyframes. Subsequently, physics-aware pruning selects tokens with high event saliency, while semantic-aware pruning further distills them to the most semantically crucial tokens using attention scores. This entire process culminates in a compact yet semantically dense visual representation, tailored for LLM inference.
Qualitative Comparisons
Visual comparisons. "LLaVA-OV" represents the original model without our method. It uses uniform sampling and misses the keyframes. In contrast, our method captures all keyframes and prunes uninformative areas.