| Rank | Model Name | Score | Provider | License | Context | Link |
|---|---|---|---|---|---|---|
🥇1 | GPT-4o | 96.3 | OpenAI | Proprietary | 128k | — |
🥈2 | Claude 3.5 Sonnet | 95.8 | Anthropic | Proprietary | 200k | — |
🥉3 | Gemini 1.5 Pro | 94.2 | Proprietary | 1M | — | |
4 | Claude 3 Opus | 93.5 | Anthropic | Proprietary | 200k | — |
5 | Llama 3.1 405B | 91.7 | Meta | Open Source | 128k | — |
6 | Command R+ | 90.1 | Cohere | Proprietary | 128k | — |
7 | Qwen 2.5 72B | 88.9 | Alibaba | Open Source | 128k | — |
8 | Llama 3.1 70B | 87.4 | Meta | Open Source | 128k | — |
9 | Mixtral 8x22B | 85.8 | Mistral AI | Open Source | 64k | — |
10 | DeepSeek v2.5 | 84.2 | DeepSeek | Open Source | 128k | — |
Key Insights
What the data tells us
Claude excels at citations
Claude 3.5 Sonnet achieves the highest RAG quality score (96.8), demonstrating exceptional ability to stay grounded in source material while providing comprehensive answers.
Context length varies
While some models support 200K+ tokens, most RAG applications perform best with focused context. Longer context windows enable more flexible chunking strategies but aren't always necessary.
Instruction following matters
Top performers excel at following RAG-specific instructions like "only use the provided context" and "cite your sources." This prevents hallucinations and ensures answer quality.
Methodology
How we rank LLMs for RAG
Quality scores combine answer faithfulness (staying true to retrieved sources) and relevance (addressing the query effectively). We evaluate models on their ability to generate accurate, well-cited responses using retrieved context. Scores are normalized to a 0-100 scale where 100 represents the best performance.
Testing Process
Each LLM is tested on question-answering tasks where they must use retrieved documents to generate accurate answers. We evaluate citation accuracy, answer completeness, and adherence to source material.
Evaluation Metrics
Primary metrics include faithfulness (percentage of answer verifiable from sources), relevance (how well the answer addresses the query), and citation quality (accuracy of source attribution).
Score Calculation
Scores are normalized relative to the best-performing model in our tests. A score of 100 represents peak RAG performance, while lower scores indicate proportional decreases in answer quality or faithfulness.
FAQ
Common questions
- What makes an LLM good for RAG?
- The best RAG LLMs excel at following instructions to cite sources, stay grounded in provided context, and avoid hallucinations. They also need strong comprehension to synthesize information from multiple retrieved documents.
- Does a longer context window always help RAG?
- Not necessarily. While longer context windows (100K+ tokens) provide flexibility, most RAG systems perform best with focused, relevant context. What matters most is retrieval quality and how well the LLM uses the provided information.
- How do these scores compare to general LLM benchmarks?
- RAG performance differs from general capabilities. A model might excel at creative writing but perform poorly at RAG due to weak citation habits. Our benchmarks specifically measure RAG-relevant skills like faithfulness and source attribution.
- Should I use an open-source or proprietary LLM for RAG?
- Proprietary models like Claude and GPT-4 currently lead in RAG quality, especially for citation accuracy. Open-source options like Llama 3 offer good performance with full control and lower costs, making them viable for many production use cases.