Leaderboard

Reranker Leaderboard

Performance comparison of the top rerankers for Retrieval-Augmented Generation (RAG), tested on diverse datasets.

Last updated: November 4, 2025

Compare
🥇1
Zerank 1
1642
0.6761126$0.025cc-by-nc-4.0
🥈2
Voyage AI Rerank 2.5
1629
0.680610$0.050Proprietary
🥉3
Contextual AI Rerank v2 Instruct
1550
0.6873010$0.050cc-by-nc-4.0
4
Voyage AI Rerank 2.5 Lite
1510
0.679607$0.020Proprietary
5
BAAI/BGE Reranker v2 M3
1468
0.6861891$0.020Apache 2.0
6
Zerank 1 Small
1458
0.6761109$0.025Apache 2.0
7
Cohere Rerank 3.5
1403
0.689492$0.050Proprietary
8
Jina Reranker v2 Base Multilingual
1335
0.6711411$0.045cc-by-nc-4.0

Overview

Our Recommendation

We recommend Voyage Rerank 2.5 as the best overall reranker for production use.

Highest Win Rate

Wins more head-to-head matchups than any other model across all benchmarks.

Superior Speed

Runs faster than other top rerankers while keeping accuracy high - ideal for production.

Strong Accuracy

Delivers high nDCG and Recall scores. Surfaces the right context without missing key details.

Methodology

How We Evaluate Rerankers

The Reranker Leaderboard tests models on three datasets — financial queries, scientific claims, and essay-style content — to see how well they adapt to different retrieval patterns in RAG pipelines.

Testing Process

Each reranker is tested on the same FAISS-retrieved documents (top-50). We measure both ranking quality and latency, capturing the real-world balance between accuracy and speed.

ELO Score

For each query, GPT-5 compares two ranked lists and picks the more relevant one. Wins and losses feed into an ELO rating — higher scores mean more consistent wins.

Evaluation Metrics

We measure nDCG@5/10 for ranking precision and Recall@5/10 for coverage. Together, they show how well a reranker surfaces relevant results at the top.

Common questions

Reranker FAQ

What is a reranker?
A reranker refines an initial list of retrieved results by reordering them so the most relevant documents appear first. Unlike basic retrieval models, rerankers use deeper scoring methods to improve search precision and ranking quality.
Why do I need a reranker for RAG?
Rerankers make Retrieval-Augmented Generation (RAG) systems more accurate. They ensure your LLM receives the most relevant context, leading to better-grounded answers — especially when your knowledge base is large or overlapping.
How much do rerankers improve results?
In our benchmarks, rerankers improved retrieval accuracy by 15–40 % compared to semantic search alone. That means cleaner context, fewer hallucinations, and more reliable RAG performance.
Why use ELO scoring for ranking?
ELO scoring measures how often one model outperforms another in direct comparisons. It reflects real-world consistency better than isolated metrics — a higher ELO means the model wins more head-to-head matchups across diverse queries.
Which datasets are used for evaluation?
We benchmark rerankers on three datasets — FiQA (finance), SciFact (science), and PG (long-form content). PG doesn’t include labeled relevance data, so it’s evaluated only with ELO-based LLM judgments, not traditional metrics like nDCG or Recall.
Should I use an open-source or proprietary reranker?
Open-source rerankers like Jina v2 offer great performance and full control for self-hosting. Proprietary options like Cohere provide slightly better accuracy and managed infrastructure. Choose based on your accuracy requirements and deployment preferences.