Monday, February 9, 2026

Voyage 4: Evaluation Notes

Umida Muratbekova
Umida Muratbekova
RAGEmbeddings

Voyage AI recently released Voyage 4. We spent some time running it through our evaluation pipeline to see how it shifts the needle for RAG compared to Voyage 3 Large and current frontier embedding models.

Here is what we found.

The Evaluation

We tested 14 models across 6 datasets, covering financial Q&A, web search, scientific claims, and entity retrieval. We used two primary lenses:

  1. Traditional metrics: NDCG@10.
  2. LLM-as-a-judge: Pairwise comparisons aggregated into ELO ratings. We had GPT-5 compare the Top-5 results from model pairs to determine which set provided better context for the query.

Key Observations

Voyage 4 is currently the strongest general-purpose model we've tested.

  • Win Rate: 61.7% across 780 head-to-head comparisons.
  • Elo: 1606 (the highest in our current pool).
  • Precision: 0.859 NDCG@10.

The most interesting signal is the Top-K density. Voyage 4's advantage over its predecessors grows by 2.3× when you evaluate only the Top-5 results instead of the Top-10. It's significantly more "top-heavy" with relevant information.

Domain Performance: Factual vs. Reasoning

voyage-plot1.png

The model isn't a linear upgrade across every domain. It has a very specific signature.

  • Entity-Rich & Business Documents: The model is strongest here. It performed exceptionally well on DBPedia and Business Reports, making it ideal for finding specific data points in professional documents.
  • The SciFact Ceiling: This was the only outlier. Voyage 4 dropped to 1465 Elo with a 90% tie rate. It seems to hit a wall on tasks requiring subtle scientific reasoning or evidence-contradiction analysis. It behaved like a high-end semantic matcher, but not a "reasoner."

The OpenAI Parity

We found a perfect stalemate between Voyage 4 and OpenAI text-embedding-3-large.

  • Record: 28 wins – 28 losses – 4 ties. Neither model could establish dominance over the other across 60 queries. They currently sit in an elite tier together, roughly 60+ Elo points above the rest of the field.

Retrieval Quality in RAG

voyage-plot2.png

Most RAG systems are optimized for Top-3 or Top-5 chunks to reduce noise and context costs. This is where Voyage 4 makes the most sense.

We looked at a specific query regarding post-2000 gold economics:

  • Voyage 4: Returned 5/5 chunks directly addressing gold price drivers (QE3, real interest rates, financial crisis).
  • Voyage 3 Large: Returned 0/5 relevant chunks. It surfaced "finance-adjacent" noise - Hyundai cruise control pricing and IRS tax procedures.

The semantic overlap in Voyage 3 Large was too broad. Voyage 4 surfaced answerable context where the predecessor only found topical keywords.

Conclusion

Voyage 4 is currently the strongest general-purpose model on our leaderboard. If your RAG system relies on factual retrieval and you are optimizing for a small context window (Top-5), this should be the new default.

Full dataset breakdowns and rankings are at the embeddings leaderboard.