Building production AI agents with tool use: patterns that actually work
Tool selection, retry and timeout strategy, evaluation, and the guardrails that keep an agent from falling over in production. A practical architecture built from real client work.

An agent is a chatbot that can do things. The model reasons about a task, picks a tool to call, reads the result, and decides what to do next. Most people building agents hit the same three walls. The agent runs forever, the tools fail in confusing ways, and nobody can tell why the answer was wrong. This post is how we build agents for clients today, with the guardrails that keep them from falling over in production.
When an agent is the right tool
The first question, before any code, is whether you actually need an agent. A lot of agent projects we review would be better off as a fixed pipeline.
Use an agent when the user's goal genuinely branches. Booking a flight involves many paths depending on availability and preferences. Answering a support ticket might need a knowledge lookup, or a database query, or an escalation. The agent picks the path.
Do not use an agent when the workflow is fixed. If you always call the same three functions in the same order, write three function calls. An LLM in the middle is a flaky, expensive coordinator.
The tool-calling loop
Every modern agent runs a short loop.
- The model reads the conversation plus the set of available tools.
- It either emits a final answer or a tool call.
- Your code runs the tool and feeds the result back as a new message.
- The loop repeats until a final answer or a stop condition fires.
This is all that the Vercel AI SDK's generateText with tools does under the hood. You can call the provider SDK directly if you want more control.
import { generateText, tool } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const lookupOrder = tool({
description: "Look up an order by its id",
parameters: z.object({ orderId: z.string() }),
execute: async ({ orderId }) => {
const order = await db.orders.findUnique({ where: { id: orderId } });
if (!order) return { error: "Order not found" };
return { id: order.id, status: order.status, total: order.total };
},
});
export async function runAgent(userMessage: string) {
const result = await generateText({
model: openai("gpt-4o"),
tools: { lookupOrder },
maxSteps: 10,
system:
"You are a support agent. Use tools to answer accurately. Cite order ids in your reply.",
prompt: userMessage,
});
return result.text;
}Tools matter more than prompts
The biggest quality decision in an agent is the tool itself, not the prompt. Three rules we follow.
- Names and descriptions are read by the model. Write them like good function docs. A description of "gets a thing" tells the model nothing.
- The parameter schema is a contract. Use zod or a JSON Schema with strict mode so the model cannot invent fields. Strict schemas catch hallucinated arguments before they reach your code.
- Error messages should be diagnostic. If the tool fails, return a message the model can use to retry. "Invalid date format, expected YYYY-MM-DD" beats "500 error" every time.
Termination conditions
An unbounded agent is a bug. We wire three limits into every agent.
- Max step count. Usually 10 or 12. If the agent has not finished by then, return a failure and surface the trace.
- Total token budget. Cheap models make this easier to ignore, but a stuck agent still burns money fast.
- Wall clock budget. 60 seconds for user-facing agents, longer for background jobs.
Hitting any limit should produce a structured error the app can handle, not just a timeout.
Observability from day one
The single biggest difference between an agent that survives in production and one that gets ripped out is tracing. Every run needs to log:
- Every tool call with its arguments, result, and duration.
- Every model call with its prompt, completion, and token count.
- The final outcome or error.
We write our own traces to Postgres for small agents and use Langfuse or LangSmith when a client already has one set up. Either works. The point is that you can pull up a failing run three weeks later and see exactly what happened.
Evaluation is the work
Prompt engineering is what junior developers do on an agent. Evaluation is what senior developers do.
Build a golden set of 30 to 100 representative tasks with known-good traces. Every change to the agent, a new tool, a different model, a prompt tweak, runs against this set. You are looking for regressions in tool selection, final answer quality, and step count.
We grade final answers with a second LLM call using a clear rubric. Human review on 10% of runs catches what the judge model misses.
What usually goes wrong
The failure modes we see most often, in rough order:
- The agent calls the same tool in a loop because the tool's error message is not specific enough to change its plan.
- Tool parameter drift. The model passes an id it invented. Strict schemas catch this.
- Latency. Every step is a round trip to the model. Four steps at two seconds each is a user waiting eight seconds with nothing to look at.
- The prompt keeps growing. Context budgets matter more than people think.
Where to go from here
If you are building your first agent, do it without a framework. The Vercel AI SDK plus a provider SDK gives you enough structure without hiding the loop. Frameworks like LangGraph are valuable later, once you have a reason to need their graph model.
If you want a second opinion on an agent design, evaluation strategy, or cost model, reach out.
