Friday, March 27, 2026

RAG for the AI SDK

Umida Muratbekova
Umida Muratbekova
RAGAI

You're building with the AI SDK and you want the model to answer questions from your own documents. The way to do this is called RAG, or retrieval-augmented generation.

Instead of relying on what the model was trained on, you retrieve relevant pieces of your data at query time and feed them to the model as context.

There are two phases:

Ingestion: you parse documents, split them into chunks, turn those chunks into vectors, and store everything in a database.

Retrieval: you embed the user's question, search for matching chunks, and return them as context.

Ingestion: prepare your data

Parsing and chunking happen outside the SDK. Once you have chunks, the AI SDK turns those chunks into vectors. embedMany can be used to embed multiple chunks in one call:

import { embedMany } from 'ai';
import { openai } from '@ai-sdk/openai';

const embeddingModel = openai.embedding('text-embedding-3-small');

const { embeddings } = await embedMany({
  model: embeddingModel,
  values: chunks,
});

Store these vectors in a vector database alongside the original text. When retrieval runs, both are needed: the vector is for similarity search, the text is to send to the model.

Retrieval: search at query time

For single queries, you use embed instead of embedMany. Same model, same vector space — that's what makes similarity search work:

import { embed } from 'ai';

async function findRelevantContent(query) {
  const { embedding } = await embed({
    model: embeddingModel,
    value: query,
  });
  // search vector store, return matching chunks
}

Search your vector database with that embedding, pull the closest matches, return the text.

The AI SDK connects this to the model through tools. You define search as a tool, and the model calls it when it needs information:

import { streamText, tool } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

streamText({
  model: openai('gpt-4o'),
  tools: {
    search: tool({
      description: 'Search the knowledge base',
      execute: async ({ query }) => findRelevantContent(query),
    }),
  },
  stopWhen: stepCountIs(3),
});

stopWhen: stepCountIs(3) limits the generation to 3 steps. A step is each time the model generates — first to call the tool, then to read results and respond. This prevents infinite loops and controls API costs.

What's Next

This post covers the baseline of implementing RAG with the AI SDK. From here, most systems add reranking to improve precision. If you want to see the top rerankers, we benchmarked them here. Another common step is hybrid search, which is about combining vector similarity with keyword matching to catch exact terms that embeddings miss.