Cutting OpenAI API costs: caching, batching, and model routing in production

At a certain scale, the OpenAI bill is the bill. We run AI features for several clients and for our own product, Quicktalog, and have spent enough on tokens to care about every one of them. This post is the three techniques that have reliably cut our OpenAI costs by 40% to 70%, in production, without degrading output quality. None of them are exotic. All of them require some work.

Know your baseline before you touch anything

Before any cost work, measure. You need to know, per feature:

Tokens in and tokens out per call.
Model used.
Call frequency.
Cost per 1,000 of whatever business outcome matters. Per catalog generated, per support ticket resolved, per page summarized.

We log every OpenAI call to Postgres with those fields. A simple dashboard answers the question "what did AI cost us this week, and which feature drove it." Without this, every cost cut is a guess.

Technique 1: prompt caching

Anthropic and OpenAI both offer prompt caching that charges a reduced rate for repeated prompt prefixes. The mental model: if 80% of your prompt is a long system prompt plus few-shot examples plus retrieved context, cache that part. The model reads it from cache. You pay a fraction of the normal token rate.

The rules for a good cache hit:

The cached portion has to be identical across calls, byte for byte.
It has to be in the same position in the prompt.
It has to meet the provider's minimum length.

Practical pattern:

lib/structure.ts

typescript

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

export async function structureMenu(ocrText: string) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: [
      {
        type: "text",
        text: LONG_STATIC_INSTRUCTIONS + FEW_SHOT_EXAMPLES,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [
      { role: "user", content: ocrText },
    ],
  });
  return response.content;
}

For Quicktalog's catalog structuring prompt, prompt caching dropped input token cost by about 65% once we restructured the prompt so the static portion came first. We keep the cached chunk stable across releases and only bust it intentionally when the prompt changes.

Technique 2: the Batch API for non-realtime work

If a feature does not need a response in the next five seconds, run it through the Batch API. OpenAI and Anthropic both discount batch jobs by roughly 50%. The tradeoff is latency. Responses come back within hours, not seconds.

Good candidates:

Generating product descriptions for a new catalog.
Summarizing a day's worth of support tickets overnight.
Re-running a new prompt against a backtest set.
Any cron-driven LLM job.

Bad candidates:

Anything a user is actively waiting on.
Chat messages.
Anything with an SLA under an hour.

The implementation is boring. Upload a JSONL of requests, poll for completion, parse the JSONL of responses. We wrap it in a small TypeScript client with retry and partial-completion handling.

lib/batch.ts

typescript

import OpenAI from "openai";

const openai = new OpenAI();

export async function submitBatch(tasks: { id: string; prompt: string }[]) {
  const jsonl = tasks
    .map((t) =>
      JSON.stringify({
        custom_id: t.id,
        method: "POST",
        url: "/v1/chat/completions",
        body: {
          model: "gpt-4o-mini",
          messages: [{ role: "user", content: t.prompt }],
        },
      })
    )
    .join("\n");

  const file = await openai.files.create({
    file: new File([jsonl], "batch.jsonl"),
    purpose: "batch",
  });

  const batch = await openai.batches.create({
    input_file_id: file.id,
    endpoint: "/v1/chat/completions",
    completion_window: "24h",
  });

  return batch.id;
}

For any feature that can tolerate the latency, halving the cost is a big deal.

Technique 3: model routing

The most expensive mistake we see is sending every request to the biggest model "to be safe." Most tasks do not need the biggest model. Matching the model to the task is where the biggest wins live.

The routing pattern:

A cheap classifier (a regex, a small model, or a first pass) decides difficulty.
Easy requests go to gpt-4o-mini or Haiku.
Medium requests go to gpt-4o or Sonnet.
Hard requests, or cases where the small model returns low confidence, escalate to the top model.

lib/router.ts

typescript

import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

type Difficulty = "easy" | "medium" | "hard";

function classify(input: string): Difficulty {
  if (input.length < 400) return "easy";
  if (input.length < 2000) return "medium";
  return "hard";
}

const MODEL_FOR: Record<Difficulty, string> = {
  easy: "gpt-4o-mini",
  medium: "gpt-4o",
  hard: "gpt-4o",
};

export async function route(input: string) {
  const difficulty = classify(input);
  const result = await generateText({
    model: openai(MODEL_FOR[difficulty]),
    prompt: input,
  });
  return result.text;
}

For Quicktalog, the OCR-to-structured-catalog step runs through this router. About 80% of inputs are handled by the small model, 15% go to the medium, and 5%, the hardest menus with unusual layouts, go to the big model. Average cost per catalog dropped by 55% with no drop in output quality on our eval set.

Evaluation is how you know it worked

Cost cuts that degrade quality silently are not a win. Every cost change we make runs against the same golden eval set before it ships:

A fixed set of 50 to 200 known tasks.
Grading with a second LLM call on a strict rubric, plus human spot checks.
We look at quality, latency, and cost side by side.

If a change cuts cost by 40% but drops quality score by 10%, we decide consciously whether that tradeoff is acceptable. Usually it is not.

Structured outputs reduce retries

One hidden cost is retries. If the model returns malformed JSON, you re-call, and that second call is full price. Structured outputs, where the API enforces a schema, drive the retry rate to near zero.

lib/extract.ts

typescript

import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

const schema = z.object({
  items: z.array(
    z.object({
      name: z.string(),
      price: z.number(),
      description: z.string().optional(),
      category: z.string().optional(),
    })
  ),
});

export async function extract(ocrText: string) {
  const { object } = await generateObject({
    model: openai("gpt-4o-mini"),
    schema,
    prompt: ocrText,
  });
  return object;
}

A single failed-and-retried call doubles your cost for that request. Cutting that rate from 5% to 0.5% is a meaningful saving on high-volume features.

Context budgets matter more than you think

Long prompts are expensive. RAG contexts especially. Six chunks of 800 tokens each, a system prompt, and the user message easily hit 6,000 input tokens. At gpt-4o pricing that is roughly $0.015 per call. Multiplied by 10,000 daily calls, that is real money.

Cheap ways to cut context:

Fewer, shorter chunks. Tune your retriever.
Summarize conversation history after N turns instead of carrying it verbatim.
Strip HTML, markdown, and whitespace from context before sending.
Reuse a cached summary where you can.

Fallbacks: cheaper model when the big one fails

For resilience, we wire a small model as the fallback when the big one times out or rate-limits. A degraded answer ships faster than a failed one, and the user sees an output. The product decides when a fallback answer is acceptable, and when the feature should just fail cleanly.

Watch the provider bill weekly

Cost surprises show up in the bill two weeks late. Weekly rollups per feature catch them earlier. We pipe our OpenAI usage logs into the same dashboard as our Stripe revenue, so cost per customer is one chart, not a spreadsheet.

What we would do differently

If we were starting Quicktalog again, the thing we would set up on day one is the logging pipeline. Every other improvement is easier once you can see what is happening. Caching and routing without measurement is a coin flip. With measurement, it is engineering.

If you are running an AI feature at scale and want a second opinion on the cost structure or a model routing design, we would be glad to help.

Know your baseline before you touch anything

Technique 1: prompt caching

Technique 2: the Batch API for non-realtime work

Technique 3: model routing

Evaluation is how you know it worked

Structured outputs reduce retries

Context budgets matter more than you think

Fallbacks: cheaper model when the big one fails

Watch the provider bill weekly

What we would do differently

Want to build something amazing? Let's bring it to life.