Gemini 2.5 Pro vs GPT-OSS 120B
Detailed comparison between Gemini 2.5 Pro and GPT-OSS 120B for RAG applications. See which LLM best meets your accuracy, performance, and cost needs.
Model Comparison
Gemini 2.5 Pro takes the lead.
Both Gemini 2.5 Pro and GPT-OSS 120B are powerful language models designed for RAG applications. However, their performance characteristics differ in important ways.
Why Gemini 2.5 Pro:
- Gemini 2.5 Pro has 113 higher ELO rating
- Gemini 2.5 Pro has a 16.5% higher win rate
Overview
Key metrics
ELO Rating
Overall ranking quality
Gemini 2.5 Pro
GPT-OSS 120B
Win Rate
Head-to-head performance
Gemini 2.5 Pro
GPT-OSS 120B
Quality Score
Overall quality metric
Gemini 2.5 Pro
GPT-OSS 120B
Average Latency
Response time
Gemini 2.5 Pro
GPT-OSS 120B
Visual Performance Analysis
Performance
ELO Rating Comparison
Win/Loss/Tie Breakdown
Quality Across Datasets (Overall Score)
Latency Distribution (ms)
Breakdown
How the models stack up
| Metric | Gemini 2.5 Pro | GPT-OSS 120B | Description |
|---|---|---|---|
| Overall Performance | |||
| ELO Rating | 1429 | 1316 | Overall ranking quality based on pairwise comparisons |
| Win Rate | 35.4% | 18.9% | Percentage of comparisons won against other models |
| Quality Score | 4.88 | 4.85 | Average quality across all RAG metrics |
| Pricing & Context | |||
| Input Price per 1M | $1.25 | $0.04 | Cost per million input tokens |
| Output Price per 1M | $10.00 | $0.19 | Cost per million output tokens |
| Context Window | 1049K | 131K | Maximum context window size |
| Release Date | 2025-06-17 | 2025-08-05 | Model release date |
| Performance Metrics | |||
| Avg Latency | 15.2s | 11.2s | Average response time across all datasets |
Dataset Performance
By benchmark
Comprehensive comparison of RAG quality metrics (correctness, faithfulness, grounding, relevance, completeness) and latency for each benchmark dataset.
MSMARCO
| Metric | Gemini 2.5 Pro | GPT-OSS 120B | Description |
|---|---|---|---|
| Quality Metrics | |||
| Correctness | 4.90 | 4.93 | Factual accuracy of responses |
| Faithfulness | 4.93 | 4.90 | Adherence to source material |
| Grounding | 4.93 | 4.90 | Citations and context usage |
| Relevance | 5.00 | 4.97 | Query alignment and focus |
| Completeness | 4.90 | 4.87 | Coverage of all aspects |
| Overall | 4.93 | 4.91 | Average across all metrics |
| Latency Metrics | |||
| Mean | 12449ms | 5616ms | Average response time |
| Min | 7629ms | 1255ms | Fastest response time |
| Max | 23066ms | 20330ms | Slowest response time |
PG
| Metric | Gemini 2.5 Pro | GPT-OSS 120B | Description |
|---|---|---|---|
| Quality Metrics | |||
| Correctness | 5.00 | 4.80 | Factual accuracy of responses |
| Faithfulness | 5.00 | 4.80 | Adherence to source material |
| Grounding | 5.00 | 4.80 | Citations and context usage |
| Relevance | 5.00 | 4.83 | Query alignment and focus |
| Completeness | 5.00 | 4.73 | Coverage of all aspects |
| Overall | 5.00 | 4.79 | Average across all metrics |
| Latency Metrics | |||
| Mean | 17834ms | 19128ms | Average response time |
| Min | 11067ms | 1317ms | Fastest response time |
| Max | 49308ms | 69491ms | Slowest response time |
SciFact
| Metric | Gemini 2.5 Pro | GPT-OSS 120B | Description |
|---|---|---|---|
| Quality Metrics | |||
| Correctness | 4.73 | 4.87 | Factual accuracy of responses |
| Faithfulness | 4.80 | 4.87 | Adherence to source material |
| Grounding | 4.80 | 4.87 | Citations and context usage |
| Relevance | 4.73 | 4.80 | Query alignment and focus |
| Completeness | 4.57 | 4.70 | Coverage of all aspects |
| Overall | 4.73 | 4.82 | Average across all metrics |
| Latency Metrics | |||
| Mean | 15314ms | 8854ms | Average response time |
| Min | 8817ms | 0ms | Fastest response time |
| Max | 35365ms | 35709ms | Slowest response time |
Explore More
Compare more LLMs
See how all LLMs stack up for RAG applications. Compare GPT-5, Claude, Gemini, and more. View comprehensive benchmarks and find the perfect LLM for your needs.