Agentset

Leaderboard

Best LLMs for RAG

The definitive ranking of language models for Retrieval Augmented Generation. Compare answer quality and citation accuracy across leading models.

Last updated: January 15, 2025

Rank	Model Name	Score	Provider	License	Context	Link
🥇1	GPT-4o	96.3	OpenAI	Proprietary	128k	—
🥈2	Claude 3.5 Sonnet	95.8	Anthropic	Proprietary	200k	—
🥉3	Gemini 1.5 Pro	94.2	Google	Proprietary	1M	—
4	Claude 3 Opus	93.5	Anthropic	Proprietary	200k	—
5	Llama 3.1 405B	91.7	Meta	Open Source	128k	—
6	Command R+	90.1	Cohere	Proprietary	128k	—
7	Qwen 2.5 72B	88.9	Alibaba	Open Source	128k	—
8	Llama 3.1 70B	87.4	Meta	Open Source	128k	—
9	Mixtral 8x22B	85.8	Mistral AI	Open Source	64k	—
10	DeepSeek v2.5	84.2	DeepSeek	Open Source	128k	—

Key Insights

What the data tells us

Claude excels at citations

Claude 3.5 Sonnet achieves the highest RAG quality score (96.8), demonstrating exceptional ability to stay grounded in source material while providing comprehensive answers.

Context length varies

While some models support 200K+ tokens, most RAG applications perform best with focused context. Longer context windows enable more flexible chunking strategies but aren't always necessary.

Instruction following matters

Top performers excel at following RAG-specific instructions like "only use the provided context" and "cite your sources." This prevents hallucinations and ensures answer quality.

Methodology

How we rank LLMs for RAG

Quality scores combine answer faithfulness (staying true to retrieved sources) and relevance (addressing the query effectively). We evaluate models on their ability to generate accurate, well-cited responses using retrieved context. Scores are normalized to a 0-100 scale where 100 represents the best performance.

Testing Process

Each LLM is tested on question-answering tasks where they must use retrieved documents to generate accurate answers. We evaluate citation accuracy, answer completeness, and adherence to source material.

Evaluation Metrics

Primary metrics include faithfulness (percentage of answer verifiable from sources), relevance (how well the answer addresses the query), and citation quality (accuracy of source attribution).

Score Calculation

Scores are normalized relative to the best-performing model in our tests. A score of 100 represents peak RAG performance, while lower scores indicate proportional decreases in answer quality or faithfulness.

FAQ

Common questions

What makes an LLM good for RAG?: The best RAG LLMs excel at following instructions to cite sources, stay grounded in provided context, and avoid hallucinations. They also need strong comprehension to synthesize information from multiple retrieved documents.

Does a longer context window always help RAG?: Not necessarily. While longer context windows (100K+ tokens) provide flexibility, most RAG systems perform best with focused, relevant context. What matters most is retrieval quality and how well the LLM uses the provided information.

How do these scores compare to general LLM benchmarks?: RAG performance differs from general capabilities. A model might excel at creative writing but perform poorly at RAG due to weak citation habits. Our benchmarks specifically measure RAG-relevant skills like faithfulness and source attribution.

Should I use an open-source or proprietary LLM for RAG?: Proprietary models like Claude and GPT-4 currently lead in RAG quality, especially for citation accuracy. Open-source options like Llama 3 offer good performance with full control and lower costs, making them viable for many production use cases.

Related Leaderboards

Explore more rankings

🔍

Best Embedding Models

Compare the top embedding models for RAG applications

🎯

Best Rerankers

Compare the top reranking models for RAG applications

Agentset

Leaderboard

Best LLMs for RAG

Key Insights

What the data tells us

Claude excels at citations

Context length varies

Instruction following matters

Methodology

How we rank LLMs for RAG

Testing Process

Evaluation Metrics

Score Calculation

FAQ

Common questions

Related Leaderboards

Explore more rankings

Best Embedding Models

Best Rerankers

Agentset

Resources

Compare

Contact

Legal