MMDocRAG Overview

MMDocRAG is built for (i) multimodal document retrieval and (ii) retrieval-augmented multimodal generation:

MMDocIR (📖Paper 🏠Homepage 👉Github): Benchmarking Multi-Modal Retrieval for Long Documents:
- encompass two distinct tasks: page-level and layout-level retrieval.
- MMDocIR_Evaluation_Dataset: 1,685 expert-annotated questions for evaluation.
- MMDocIR_Train_Dataset: 173,843 bootstrapped questions for training.
- Retriever Checkpoints: 6 text and 4 vision retrievers.
MMDocRAG (📖Paper 🏠Homepage 👉Github): Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
- design for multimodal quotes selection and multimodal integration
- MMDocRAG: 4,055 expert-annotated questions and multimodal answer.
- MMDocRAG Training: 4,110 training samples derived from dev set.
- Retriever Checkpoints: 6 text and 4 vision retrievers.
- Trained models are: MMDocRAG_Qwen2.5-3B-Instruct_lora, MMDocRAG_Qwen2.5-7B-Instruct_lora, MMDocRAG_Qwen2.5-14B-Instruct_lora, MMDocRAG_Qwen2.5-32B-Instruct_lora, and MMDocRAG_Qwen2.5-72B-Instruct_lora.

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Kuicai Dong* · Yujing Chang* · Derrick Xin Deik Goh* · Dexun Li · Ruiming Tang · Yong Liu

📖Paper | 🏠Homepage | 🤗Huggingface | 👉Github

Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text.

MMDocRAG: Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

Kuicai Dong* · Yujing Chang* · Shijie Huang · Yasheng Wang · Ruiming Tang · Yong Liu

📖Paper | 🏠Homepage | 🤗Huggingface | 👉Github

Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence integration and selection. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that combine text with relevant visual elements. Through large-scale experiments with 60 language/vision models and 14 retrieval systems, we identify persistent challenges in multimodal evidence handling. Key findings reveal proprietary vision-language models show moderate advantages over text-only models, while open-source alternatives trail significantly. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems.

Citation

If you use any datasets from this organization in your research, please cite the original dataset as follows:

@misc{dong2025mmdocirbenchmarkingmultimodalretrieval,
      title={MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents}, 
      author={Kuicai Dong and Yujing Chang and Xin Deik Goh and Dexun Li and Ruiming Tang and Yong Liu},
      year={2025},
      eprint={2501.08828},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2501.08828}, 
}

@misc{dong2025benchmarkingretrievalaugmentedmultimomalgeneration,
      title={Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering}, 
      author={Kuicai Dong and Yujing Chang and Shijie Huang and Yasheng Wang and Ruiming Tang and Yong Liu},
      year={2025},
      eprint={2505.16470},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.16470}, 
}