Building production RAG systems with Next.js, LangChain, and the Vercel AI SDK

"Just add ChatGPT to your docs" has become shorthand for a whole family of features: internal knowledge search, customer support bots, technical assistants. Most of them are actually RAG systems. Getting a prototype running is easy. Getting one that answers correctly, consistently, and cheaply in production is where most teams get stuck. This post is the stack we would use today to build that for a client, with code-shaped explanations rather than a full repo walkthrough.

When RAG is the right tool

Before you build one, be honest about whether you need it. RAG is the right choice when:

Your knowledge changes frequently, like docs, tickets, or policies.
You need provenance, meaning you can show which source the answer came from.
The corpus is too big to fit in a single prompt. Even with a 1M context window, you still pay for every token.

If your corpus is small and stable, stuff it into the system prompt and use prompt caching. If it is huge but you only care about one domain, fine-tuning might be cheaper per query. RAG wins when the corpus is both large and churning.

The stack

Next.js App Router. Server actions for ingestion, route handlers for chat.
LangChain (JS). Document loaders, chunkers, and retrievers. We use LangChain for the ingestion pipeline, not for the chat loop.
Vercel AI SDK. Streaming, structured outputs, and a clean hook for the UI. This is what renders tokens to the user.
pgvector on Postgres. Vector store. Cheaper and simpler than a dedicated vector DB for most use cases.
OpenAI or Anthropic for embeddings and generation.

Splitting responsibilities this way is intentional. LangChain is great at ingestion orchestration. The Vercel AI SDK is great at streaming and UI integration. Combining them avoids the biggest pain points of each.

Ingestion: document to chunks to embeddings to pgvector

app/actions/ingest.ts

typescript

"use server";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "@langchain/openai";
import { sql } from "@/lib/db";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 120,
});

const embedder = new OpenAIEmbeddings({ model: "text-embedding-3-small" });

export async function ingest(docId: string, text: string) {
  const chunks = await splitter.splitText(text);
  const vectors = await embedder.embedDocuments(chunks);
  for (let i = 0; i < chunks.length; i++) {
    await sql`
      insert into doc_chunks (doc_id, chunk_index, content, embedding)
      values (${docId}, ${i}, ${chunks[i]}, ${JSON.stringify(vectors[i])}::vector)
    `;
  }
}

Three things are load-bearing here.

Chunk size 600 to 1000 tokens, 10 to 20% overlap. Bigger chunks lose precision in retrieval. Smaller chunks lose context. This range works for most prose.
Store the source metadata. Title, URL, section heading. Whatever you will want to show the user as a citation.
Idempotency. Re-ingesting should update, not duplicate. Delete-then-insert by doc_id is simpler than a full diff.

Retrieval: similarity and filtering

lib/retrieve.ts

typescript

export async function retrieve(query: string, k = 6) {
  const [queryVec] = await embedder.embedDocuments([query]);
  const rows = await sql`
    select content, doc_id, 1 - (embedding <=> ${JSON.stringify(queryVec)}::vector) as score
    from doc_chunks
    order by embedding <=> ${JSON.stringify(queryVec)}::vector
    limit ${k}
  `;
  return rows.filter((r) => r.score > 0.3);
}

A few retrieval tricks that repeatedly helped us:

Score threshold, not just top-k. If nothing clears the bar, tell the user you do not know. That is better than hallucinating on bad matches.
Hybrid search. Combine vector similarity with Postgres full-text (ts_rank) for keyword-heavy queries like error codes or proper names.
Rerank the top 20. A cheap cross-encoder reranker, or a small LLM call, on the top 20 vector hits boosts answer quality more than any prompt tweak.

Generation: streaming with the Vercel AI SDK

app/api/chat/route.ts

typescript

import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
import { retrieve } from "@/lib/retrieve";

export async function POST(req: Request) {
  const { messages } = await req.json();
  const latest = messages[messages.length - 1].content;
  const hits = await retrieve(latest);

  const context = hits.map((h, i) =>
    `[${i + 1}] ${h.content}`
  ).join("\n\n");

  const result = streamText({
    model: openai("gpt-4o-mini"),
    system: `Answer using only the context below. Cite sources as [n].
Context:
${context}`,
    messages,
  });

  return result.toDataStreamResponse();
}

On the client, useChat from ai/react gives you streaming tokens and message state with one hook. That is the pair we reach for every time.

Evaluation, the step most teams skip

A RAG system is only as good as its test set. Build a golden set of 50 to 100 questions with known-good answers, and run it every time you change the prompt, the chunker, or the retrieval strategy. Three metrics are enough to start.

Retrieval hit rate. Did the right chunk appear in the top k?
Faithfulness. Does the answer only use retrieved content? Grade it with another LLM call.
Answer quality. Human-rated on a small sample, or LLM-as-judge on a larger one.

Things that bit us

Token budgets creep. Six chunks of 800 tokens plus a system prompt and the conversation history adds up fast. Budget early.
Streaming with edge functions. Works great until you need long generation with connection retries. It is worth running the route handler as Node, not edge, for anything beyond 30 seconds.
pgvector indexes. Without an HNSW index, queries degrade linearly. Add the index after your initial load, not before.

Ship small, measure, iterate

The teams we have seen succeed with RAG all did the same thing. They shipped a thin version to a small group, measured retrieval and answer quality every week, and tightened one knob at a time. The ones that stalled spent six weeks on a generic framework before any real users touched it.

If you are planning to build a RAG feature and want a second pair of eyes on the architecture, cost model, or evaluation strategy, reach out.

When RAG is the right tool

The stack

Ingestion: document to chunks to embeddings to pgvector

Retrieval: similarity and filtering

Generation: streaming with the Vercel AI SDK

Evaluation, the step most teams skip

Things that bit us

Ship small, measure, iterate

Want to build something amazing? Let's bring it to life.