| MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding | Link | Code | arXiv | AIMing Lab |
| Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling | Link | Code | arXiv | Qwen |
| ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents | Link | Code | arXiv | Alibaba NLP |
| DeepSeek-OCR: Contexts Optical Compression | Link | Code | arXiv | DeepSeek |
| DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding | Link | Code | arXiv | Alibaba |
| SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding | Link | Project | arXiv | Georgia Tech & JPMorgan Research |
| DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding | Link | Project | arXiv | Google Cloud |