When you run RAG over multimodal content, the core choice is whether to embed the images directly with a multimodal model, or to extract a text representation and embed that instead.
We ran a controlled comparison between a text-based retrieval pipeline and a native multimodal one across different document types to see where each approach wins. This post walks through what we found, where the differences actually show up, and where they don’t.
Setup
We evaluated two retrieval approaches on a focused sample from three public datasets: DocVQA (~10.5k examples), ChartQA (~32.7k examples), and AI2D (~3.1k examples).
The first approach is a text-based retrieval: content is described as text, chunked, embedded with an OpenAI text embedding model, and retrieved via vector similarity search. For a diagram, that conversion step might produce a chunk like this marine food web example:


The second approach is using a multimodal retrieval: content is kept as images and embedded directly with Voyage Multimodal 3.5, instead of going through text extraction first.
For reporting, we group the evaluation into three content sets:
- Set A: pure text documents (from DocVQA)
- Set B: tables (from DocVQA)
- Set C: charts (from ChartQA, with additional visuals from AI2D)
Each set contains 50 queries with gold answers. We report Recall@1, Recall@5, and MRR (Mean Reciprocal Rank), which measures how high the correct result typically ranks (higher is better).



Examples from the three sets, from left to right: text, tables, and diagrams.
Key Findings
The overall pattern is: multimodal embeddings do better on image-based content, and the gap is largest where layout and visual structure are important.
Recall@5 is near-perfect across all sets, meaning both approaches often retrieve something relevant in the top results. The difference is whether the correct item lands at the top (Recall@1) and how consistently it ranks early (MRR).
- Text: text embeddings perform better at Recall@1 (96% vs 92%)
- Tables: multimodal embeddings perform substantially better at Recall@1 (88% vs 76%)
- Charts: performance is close, but multimodal is still ahead (92% vs 90%)
Breakdown
Text documents
On pure text documents, the text-based pipeline performs slightly better than the multimodal one. Both pipelines reach 100% Recall@5, but text embeddings more often place the correct context at rank one.
This is the one category where multimodal embeddings don’t add value: there is no visual structure to preserve, and the strongest signal is already in the text. For text-only corpora, text embeddings remain the natural choice.

Tables
Tables produce the largest gap in the evaluation.
Linearizing tables into text removes structural information that many queries depend on: row–column relationships, alignment, grouping, and relative position. Even with careful extraction, that layout is hard to represent faithfully as plain text. Multimodal embeddings keep the table as a visual object and preserve those relationships implicitly, and retrieval benefits accordingly.
This shows up directly in the metrics: a 12-point Recall@1 difference in favor of multimodal embeddings.

If a corpus contains a meaningful amount of tabular data, this is where multimodal embeddings provide a clear and measurable advantage.
Charts
Charts were the most interesting case because the gap was smaller than expected: multimodal is still higher (92% vs 90%), but only slightly.

The gap is smaller here because, as in the marine food web example, the extraction step produces structured descriptions with explicit relationships, so many chart questions can already be answered from text alone. Multimodal still has an edge because it also sees layout and grouping that never appear in the extracted description.
Across these document types, representation controls what signal is available to match against. But for pages that are fundamentally visual, the pattern is consistent: multimodal embeddings handle them more naturally, and text pipelines are approximating that by reconstructing structure in language.
Conclusion
From this benchmark, a few patterns became clear:
- For pure text, text embeddings are enough and slightly better.
- For image-based content, multimodal embeddings are the better default, with the largest gap on tables and a smaller but consistent edge on charts.
The difference mostly comes down to what gets lost when an image is converted into text. Tables (and many chart questions) rely on layout and visual grouping. Multimodal embeddings retain that signal directly, which keeps them ahead in our results.
