omnimem compresses audio-visual llm memory for streaming video

source: arxiv artificial intelligence: omnimem: perturbation-aware memory compression for streaming audio-visual llms

level: research

audio-visual large language models can understand long videos but struggle with growing memory demands as video length increases. the number of tokens and key-value caches grows linearly, making inference expensive. omnimem is a streaming framework that compresses memory for these models. it does not treat all tokens the same. instead, it uses a modality-aware strategy to handle visual and audio contexts separately. this addresses the large imbalance between the many visual tokens and fewer audio tokens.

omnimem also uses perturbation-aware memory selection to keep only informative and non-redundant key-value states. this keeps the memory compact without losing the ability to understand long-range dependencies. the method identifies which parts of the memory are most useful and discards the rest. this is different from uniform compression that might remove important details. the framework is designed for streaming, so it can process video in real time while keeping memory use low.

the researchers also explore budget-aware fine-tuning to make the model better at compression under real-world limits. this training step encourages the model to pack useful information into the retained memory. the approach is tested on long-form video tasks and shows it can maintain performance while using less memory. the work is relevant for deploying audio-visual models on devices with limited resources, like phones or edge hardware.

why it matters: it enables efficient long-video understanding on resource-limited devices by reducing memory growth without losing accuracy.

source: arxiv artificial intelligence: omnimem: perturbation-aware memory compression for streaming audio-visual llms