Agent cost control: patterns that actually work
Concrete patterns for keeping LLM spend predictable in a multi-tenant agent platform — budgets, per-step caps, model routing, and the hard cut-off that saves the weekend.
Every team that ships an agent platform eventually has the same bad weekend. A bad prompt, a runaway loop, a misconfigured retry — and suddenly the finance Slack channel is on fire. This guide is the set of patterns we recommend to make sure your team never has that weekend.
Pattern 1: per-tenant hard budgets
The baseline. Every tenant gets a daily budget. When they hit it, new requests return 402 Payment Required and in-flight requests terminate at the next step boundary.
from fastagentic.cost import RedisCostTracker
app = App(
cost_tracker=RedisCostTracker(
redis_url="redis://localhost",
budget_per_tenant=Budget(
daily=100.0, # $100/day
hourly=20.0, # $20/hour burst cap
currency="USD",
),
),
)
Two knobs, both hard. The hourly cap catches bugs the daily cap would only catch after several hours.
Pattern 2: per-user budgets inside tenants
Multi-tenant is rarely enough. Inside a tenant, you usually want each user to have their own cap — otherwise one user’s runaway agent drains the whole tenant’s budget in an hour.
@agent_endpoint(
"/research",
policies=[
RateLimitPolicy(per_user="50/hour"),
BudgetPolicy(per_user="10/day"),
],
)
async def research(query: str) -> str: ...
Pattern 3: per-endpoint budgets
Some endpoints are inherently more expensive than others. Give the expensive ones tighter caps:
@agent_endpoint(
"/deep-research", # hits GPT-4o + 20 tool calls
policies=[BudgetPolicy(per_user="5/day", per_call_max=2.0)],
)
async def deep_research(query: str) -> str: ...
per_call_max=2.0 is the safety net: if a single call is projected to exceed $2, it halts. This catches runaway loops before they finish.
Pattern 4: step-level budget checks
In long LangGraph pipelines, you want to check the budget between steps, not just at the end. FastAgentic’s StepTracker does this automatically when a cost tracker is configured.
@agent_endpoint(
"/pipeline",
adapter=LangGraphAdapter(graph, track_steps=True),
policies=[BudgetPolicy(per_call_max=5.0, check_per_step=True)],
)
async def pipeline(q: str) -> str: ...
Every node transition checks the cumulative cost. If the limit is hit, the run terminates cleanly (checkpoint saved, response streamed with a partial-failure marker) instead of charging forward.
Pattern 5: model routing by budget
Not every request needs the expensive model. Route requests to cheaper models when the user’s remaining budget is low:
from fastagentic.routing import budget_aware_router
router = budget_aware_router(
default="anthropic:claude-sonnet-4-6",
fallback_below={
20.0: "anthropic:claude-haiku-4-5",
5.0: "openai:gpt-4o-mini",
},
)
agent = Agent(model=router, ...)
Users with lots of budget left get the good model. Users close to their cap get the cheap one. The degradation is graceful instead of cliff-edged.
Pattern 6: pre-flight cost estimation
Before a run starts, you can estimate its cost from prompt size, expected tool calls, and historical data:
estimate = await cost_tracker.estimate(
prompt=query,
expected_steps=graph.max_depth,
model=agent.model_name,
)
if estimate > user.remaining_budget:
raise HTTPException(402, f"Estimated cost ${estimate} exceeds remaining budget")
This turns out-of-budget errors from late failures into early, cheap rejections.
Pattern 7: cost attribution dashboards
You can’t control what you can’t see. FastAgentic emits cost events to your observability stack (Langfuse, Datadog, OTel) with tags for user, tenant, endpoint, model, and run ID. The first dashboard we build for every client has four charts:
- Cost per tenant, last 24 hours.
- Cost per endpoint, last 7 days.
- Runaway detector: runs where cost > 3 standard deviations above median.
- Model mix: % of spend on each model, trended weekly.
If you can’t see those four at a glance, you’re flying blind.
Pattern 8: kill switches
Sometimes the right answer is “turn the whole endpoint off.” FastAgentic has a feature-flag integration that lets you disable an endpoint, a tenant, or a user in Redis without a redeploy:
fastagentic policy disable-endpoint /deep-research --reason "investigating cost spike"
In-flight requests finish; new ones get 503. Turn it back on when the issue is understood.
The anti-patterns
Things we see teams do that don’t work:
- Soft warnings without hard caps. “We’ll just send a Slack alert.” The alert arrives an hour after the damage.
- Cost caps at the LLM provider dashboard. Providers rate-limit unpredictably and don’t tell your app until the 429 comes back.
- Per-request retries with exponential backoff and no budget check. Each retry compounds cost. Always check budget before retry.
- Cost tracking only at the end of a run. Runaway loops can spend the entire budget before the first result lands.
- Shared budgets across dev and prod. Separate them. Dev will blow up; prod should not.
Minimum viable cost control
If you only have two hours, implement this much:
- Per-tenant daily hard budget via
RedisCostTracker. per_call_maxon every endpoint.- A Grafana (or Langfuse) panel showing cost-per-tenant for the last 24 hours.
That covers 90% of the incidents we see.
The longer game
Once the basics are in place, graduate to model routing, pre-flight estimation, and kill switches. They’re not critical on day one, but they’re the difference between a platform that survives its first viral customer and one that doesn’t.
Need FastAPI, LangGraph, or agent platform expertise?
Neul Labs — the team behind FastAgentic — takes on a limited number of consulting engagements each quarter. We help teams ship agents to production, fix broken LangGraph pipelines, and design governance for multi-tenant LLM platforms.