Wednesday, November 19, 2025

Gemini 3 vs GPT 5.1 for RAG

Umida Muratbekova
Umida Muratbekova
RAGLlm Evals

Gemini 3 was released yesterday, we wanted to see how it behaves inside a real RAG pipeline, not just in isolated prompt tests.

We used the same retrieval, same context, and the same evaluation flow to compare it directly with GPT 5.1.

We looked at five areas that matter for RAG: conciseness, grounding, relevance, completeness, and source usage.

Conciseness

We started by looking at conciseness: can the models ignore the noise and stick to the essential answer?

GPT 5.1: Answered the “boil eggs” question with only the essential steps, giving a short and clean method.

Gemini 3: Pulled in everything from the retrieved chunk — sous-vide, temperature variants, extra techniques — making the answer far longer than needed.

conciseness.png

Citations & grounding

What about when you ask something the model doesn’t know?

In our tests, both models refused to answer questions they couldn’t ground in the context, which is what you want. But when the context was messy, they added extra details in different ways:

GPT 5.1: Over-explained its refusal by citing every unrelated name or topic mentioned.

Gemini 3: Pulled in random facts it saw inside the retrieved chunks.

citations.png

Relevance

In this example as well, when asked about symptoms of dehydration,

GPT 5.1: Pulled in a long list of unrelated medical symptoms, drifting away from the relevant retrieved content.

Gemini 3: Stayed closer to the retrieved text and ignored irrelevant symptoms.

relevance.png

Completeness / reasoning

When it comes to explaining a process or definition, both models answered the question, but neither stayed tightly focused.

GPT 5.1: Included every line that mentioned any analyst keyword, adding responsibilities that were not relevant to AML.

Gemini 3: Listed the main duties from the retrieved text without adding irrelevant roles.

reasoning.png

Source usage

How well do they use sources for each query?

GPT 5.1: Added a lot of extra details, dumping everything it found about WiFi and Bluetooth.

Gemini 3: Organized the retrieved text better and kept the comparison concise and focused. As you see, it also answers using the correct chunks.

source.png

What can we say about each model with these comparisons?

CategoryGemini 3GPT-5.1Verdict
ConcisenessMuch longer; includes every detail it finds.Shorter, sticks to core steps. Avoids extra methods.GPT-5.1
Citations & GroundingRefuses cleanly but also over-cites.Refuses but over-cites unrelated names/topics.Both are same
RelevanceStays on-topic; ignores irrelevant symptoms.Drifts by including unrelated medical details.Gemini 3
Completeness / ReasoningCovers the core duties without adding irrelevant roles.Adds unrelated analyst responsibilities; more noise.Gemini 3
Source UsageOrganizes retrieved text into a concise answer.Dumps all details from the context; heavy output.Gemini 3

In 3/5 cases, Gemini 3 stays closer to the retrieved text, keeps answers focused, and avoids drifting.

ChatGPT 5.1 is more expressive but also more prone to pulling in extra unnecessary details.

This shows that each model leans toward a different style, so maybe the right choice depends on the type of answers you value.