EventSTU: Event-Guided Efficient Spatio-Temporal Understanding
for Video Large Language Models

Wenhao Xu, Xin Dong, Yue Li, Haoyuan Shi, Zhiwei Xiong,

University of Science and Technology of China

Left: Our key innovation is the dual exploitation of events: leveraging their change-triggered property to eliminate redundant frames, and utilizing their inherent visual saliency to prune less important tokens. Notably, our framework can be driven by either real-world events or simulated events. Right: Our framework breaks the conventional efficiency-performance trade-off, reducing FLOPs by 66.7% yet still outperforming the original model. This holds true whether using token pruning alone or combined with keyframe sampling.

Abstract

Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01× FLOPs reduction and 3.10× prefilling speedup over the strongest baseline while still improving performance.

EventBench Statistics

Top: A sunburt and toy example of EventBench. Bottom: Statistics of EventBench including the distributions of video lengths and capture platforms.

Task Examples

Visualization on EventBench. Random examples from counting, ordering, action recognition, action prediction, action retrospection, object recognition, object attribute, and object location task are shown. Ground-truth answers are bolded, while yellow and green boxes highlight visual content relevant to the questions. (Zoom in for best view)

Event-Guided Video Understanding

Overview of EventSTU. It processes long videos through a sequential, multi-stage pipeline. First, coarse-to-fine sampling efficiently filters redundant frames based on event density and retrieves question-relevant keyframes. Subsequently, physics-aware pruning selects tokens with high event saliency, while semantic-aware pruning further distills them to the most semantically crucial tokens using attention scores. This entire process culminates in a compact yet semantically dense visual representation, tailored for LLM inference.

Qualitative Comparisons

Visual comparisons. "LLaVA-OV" represents the original model without our method. It uses uniform sampling and misses the keyframes. In contrast, our method captures all keyframes and prunes uninformative areas.

EventSTU: Event-Guided Efficient Spatio-Temporal Understandingfor Video Large Language Models