Leaderboard

Best LLMs for RAG

The definitive ranking of language models for Retrieval Augmented Generation. Compare answer quality and citation accuracy across leading models.

Last updated: January 15, 2025

RankModel NameScoreProviderLicenseContextLink
🥇1
GPT-4o
96.3
OpenAIProprietary128k
🥈2
Claude 3.5 Sonnet
95.8
AnthropicProprietary200k
🥉3
Gemini 1.5 Pro
94.2
GoogleProprietary1M
4
Claude 3 Opus
93.5
AnthropicProprietary200k
5
Llama 3.1 405B
91.7
MetaOpen Source128k
6
Command R+
90.1
CohereProprietary128k
7
Qwen 2.5 72B
88.9
AlibabaOpen Source128k
8
Llama 3.1 70B
87.4
MetaOpen Source128k
9
Mixtral 8x22B
85.8
Mistral AIOpen Source64k
10
DeepSeek v2.5
84.2
DeepSeekOpen Source128k

Key Insights

What the data tells us

Claude excels at citations

Claude 3.5 Sonnet achieves the highest RAG quality score (96.8), demonstrating exceptional ability to stay grounded in source material while providing comprehensive answers.

Context length varies

While some models support 200K+ tokens, most RAG applications perform best with focused context. Longer context windows enable more flexible chunking strategies but aren't always necessary.

Instruction following matters

Top performers excel at following RAG-specific instructions like "only use the provided context" and "cite your sources." This prevents hallucinations and ensures answer quality.

Methodology

How we rank LLMs for RAG

Quality scores combine answer faithfulness (staying true to retrieved sources) and relevance (addressing the query effectively). We evaluate models on their ability to generate accurate, well-cited responses using retrieved context. Scores are normalized to a 0-100 scale where 100 represents the best performance.

Testing Process

Each LLM is tested on question-answering tasks where they must use retrieved documents to generate accurate answers. We evaluate citation accuracy, answer completeness, and adherence to source material.

Evaluation Metrics

Primary metrics include faithfulness (percentage of answer verifiable from sources), relevance (how well the answer addresses the query), and citation quality (accuracy of source attribution).

Score Calculation

Scores are normalized relative to the best-performing model in our tests. A score of 100 represents peak RAG performance, while lower scores indicate proportional decreases in answer quality or faithfulness.

FAQ

Common questions

What makes an LLM good for RAG?
The best RAG LLMs excel at following instructions to cite sources, stay grounded in provided context, and avoid hallucinations. They also need strong comprehension to synthesize information from multiple retrieved documents.
Does a longer context window always help RAG?
Not necessarily. While longer context windows (100K+ tokens) provide flexibility, most RAG systems perform best with focused, relevant context. What matters most is retrieval quality and how well the LLM uses the provided information.
How do these scores compare to general LLM benchmarks?
RAG performance differs from general capabilities. A model might excel at creative writing but perform poorly at RAG due to weak citation habits. Our benchmarks specifically measure RAG-relevant skills like faithfulness and source attribution.
Should I use an open-source or proprietary LLM for RAG?
Proprietary models like Claude and GPT-4 currently lead in RAG quality, especially for citation accuracy. Open-source options like Llama 3 offer good performance with full control and lower costs, making them viable for many production use cases.