source: arxiv artificial intelligence: cave-vlm-cot: an interpretable vision-language model framework

level: research

vision-language models often generate fluent but visually unfaithful descriptions, a problem known as hallucination. existing methods like chain-of-thought or retrieval augmentation do not fully solve this because they lack step-level citation grounding and cannot route verification failures back to retrieval for correction. the cave-vlm-cot framework introduces a five-stage closed-loop pipeline: extractor, retriever, solver, citation injector, and verifier. when the verifier detects ungrounded claims, it sends structured feedback to the extractor for targeted re-retrieval, ensuring each reasoning step is tied to evidence.

the framework works as a modular agentic retrieval-augmented generation system. first, the extractor identifies key entities and queries from the image and text. the retriever fetches relevant external knowledge. the solver generates an answer with chain-of-thought reasoning. the citation injector then attaches source citations to each reasoning step. finally, the verifier checks every claim against the retrieved evidence and image. if a claim lacks support, the system loops back to refine the retrieval, creating a self-correcting process.

to evaluate this approach, the authors propose 23 component-wise metrics covering retrieval quality, step-wise citation precision and recall, and cross-modal grounding. the main metric, cavescore, combines accuracy, citation faithfulness, and attribution into a single score. experiments show that cave-vlm-cot significantly reduces hallucinations compared to baselines, while providing interpretable, verifiable outputs. the modular design allows swapping components, making it adaptable to different vision-language tasks.

why it matters: reducing hallucinations in vision-language models is critical for trustworthy ai in applications like medical imaging, autonomous driving, and content moderation, where factual errors can have serious consequences.


source: arxiv artificial intelligence: cave-vlm-cot: an interpretable vision-language model framework