Part of our Performance & Scalability series
Read the complete guideOptimizing AI Agent Costs: Token Usage and Caching
AI agent operational costs can scale from manageable to alarming surprisingly quickly. An agent processing 10 transactions per day is inexpensive. The same agent processing 5,000 transactions per day, with each transaction requiring 3-4 LLM calls with large context windows, can generate thousands of dollars in monthly API costs — costs that weren't in the original ROI model.
Cost optimization is not optional for production-scale AI deployments. It's the difference between an agent that delivers positive ROI and one that erodes it. This guide covers the practical strategies that reduce costs by 40-70% in typical OpenClaw deployments without compromising output quality.
Key Takeaways
- Token optimization (prompt compression, context pruning) reduces API costs 25-40% with no quality loss
- Semantic caching eliminates LLM calls for repeated or similar requests, reducing costs 30-60% in many workloads
- Model routing uses cheap models for simple tasks and expensive models only when needed
- Prompt caching (where available from providers) reduces input token costs for repetitive system prompts
- Batch processing reduces per-call overhead for high-volume, non-time-sensitive workloads
- Cost monitoring with per-workflow attribution identifies the most expensive agent behaviors
- Streaming reduces time-to-first-token latency for user-facing agents without increasing total cost
- A comprehensive cost optimization strategy typically reduces total LLM spend by 45-65% vs. unoptimized deployments
Understanding AI Agent Cost Drivers
Before optimizing costs, understand what drives them. LLM API costs are primarily based on token consumption:
Input tokens: Every token sent to the model costs money — the system prompt, the user message, the retrieved context (RAG chunks), conversation history, and any examples (few-shot). Input token costs are typically 2-5x lower than output token costs for current frontier models.
Output tokens: Tokens generated by the model in its response. Verbose outputs cost more. Reasoning steps (chain-of-thought) cost more than direct answers. Structured JSON outputs cost more than prose if the JSON has many fields.
Call volume: Every LLM call has a minimum cost. Multi-step agents that make 5 LLM calls per task cost 5x more than single-call agents — but may produce much better results. The key is eliminating unnecessary calls.
Model selection: The cost difference between models is enormous. Claude 3 Haiku costs ~50x less than Claude 3 Opus per token. GPT-4o costs ~15x more than GPT-4o mini. Using a frontier model for every task is the most common source of unnecessary cost.
A realistic cost scenario:
Agent processes 1,000 customer service tickets per day. Each ticket requires:
- System prompt: 800 tokens
- Retrieved context: 1,200 tokens
- Ticket content: 400 tokens
- Total input: 2,400 tokens
- Response: 600 tokens
Using Claude 3.5 Sonnet ($3/M input, $15/M output):
- Daily cost: 1,000 × [(2,400 × $3/M) + (600 × $15/M)] = $16.20/day = $486/month
With optimization (shown in this guide), this drops to $150-$200/month — a 60% reduction.
Prompt Compression and Token Reduction
System Prompt Optimization
System prompts are sent with every request. A bloated 2,000-token system prompt that could be compressed to 800 tokens without information loss is paying 2.5x more than necessary on input tokens.
Techniques:
Remove redundancy: Review your system prompts for information that's restated in multiple places. Consolidate.
Use compressed language: Avoid conversational preamble. Compare:
Verbose (47 tokens): "You are a helpful assistant that is skilled at reviewing contracts. Your job is to carefully read through the contract and identify any clauses that might represent a risk to our company."
Compressed (23 tokens): "You are a contract risk analyst. Identify clauses representing risk to the client company."
The compressed version conveys identical instructions. LLMs respond to semantic content, not word count.
Use structured formatting: Numbered lists and bullet points convey information more densely than paragraphs.
Remove examples from system prompt when using few-shot: If you have examples in both the system prompt and the user message, you're paying for them twice. Consolidate to one location.
Audit system prompt length regularly: System prompts tend to grow as teams add instructions over time without removing outdated ones. A quarterly review typically finds 20-30% of system prompt content can be removed or compressed.
Context Window Management
RAG (Retrieval Augmented Generation) retrievals are one of the largest cost drivers for knowledge-intensive agents. Each retrieved chunk is input tokens. Unoptimized RAG frequently retrieves more context than needed.
Chunk size optimization: Smaller chunks (256-512 tokens) retrieved in higher quantities often outperform large chunks (1,000+ tokens) for factual question answering. Smaller chunks are also cheaper because irrelevant passages within a large chunk aren't retrieved.
Retrieval count tuning: If your agent retrieves 10 chunks per query but consistently uses information from only the top 2-3, reduce the retrieval count. Monitor which retrieved chunks are actually referenced in agent outputs.
Relevance filtering: Apply a relevance score threshold — only include retrieved chunks above the threshold in the context. Chunks with low relevance add cost without improving quality.
Conversation history pruning: For multi-turn agents, conversation history grows with each turn. Older turns are often less relevant. Implement a summarization strategy: after 8-10 turns, summarize early conversation into a compressed summary (200-300 tokens) rather than retaining full turn-by-turn history.
def manage_conversation_history(messages: list, max_tokens: int = 2000) -> list:
"""Prune conversation history to stay within token budget"""
# Always keep system message and last N user/assistant turns
if count_tokens(messages) <= max_tokens:
return messages
# Summarize early conversation if too long
early_messages = messages[1:-6] # Exclude system + recent 3 turns
summary = summarize_conversation(early_messages)
return [
messages[0], # System message
{"role": "user", "content": f"[Earlier conversation summary: {summary}]"},
*messages[-6:] # Recent 3 turns
]
Semantic Caching
Semantic caching is the highest-impact cost optimization for agents handling repetitive queries. It stores the result of LLM calls and returns cached results for subsequent requests that are semantically similar — even if not identical.
How Semantic Caching Works
- When an LLM call is made, compute an embedding vector for the input (prompt + context)
- Search the cache for stored results with high vector similarity to the current input
- If similarity exceeds the threshold, return the cached result (no LLM call)
- If not, make the LLM call and store the result with its embedding
The critical insight: many real-world requests are semantically similar even when not textually identical. "What is the return policy for orders placed in the last 30 days?" and "Can I return something I ordered 3 weeks ago?" are different words but the same question — semantic caching can serve the second from the cache of the first.
Cache Hit Rate by Agent Type
| Agent Type | Expected Cache Hit Rate | Rationale |
|---|---|---|
| FAQ / customer support | 50-75% | Common questions repeat frequently |
| Data lookup (product info, pricing) | 40-65% | Same products queried repeatedly |
| Document classification | 30-50% | Similar document types appear repeatedly |
| Report narrative generation | 20-40% | Trends are similar across periods |
| Custom workflow orchestration | 5-15% | Each case is highly unique |
| Data analysis | 10-25% | Questions are varied but some repeat |
For customer support agents with 65% cache hit rate, semantic caching reduces LLM call volume — and therefore LLM cost — by 65%.
Cache Configuration
Similarity threshold: The threshold for declaring two requests "similar enough" for cache reuse. Higher threshold = fewer cache hits but higher accuracy. Lower threshold = more cache hits but risk of returning subtly wrong answers for dissimilar requests.
For factual queries, a similarity threshold of 0.92-0.95 is typically safe. For analytical or reasoning tasks, use a higher threshold (0.97+) to avoid returning incorrect analysis for subtly different questions.
Cache TTL: Different cache entry types should have different expiration periods:
- Product pricing: 1-4 hours (prices change)
- Policy information: 24-48 hours (policies rarely change)
- General knowledge: 7 days (very stable information)
- Generated reports: Cache until the underlying data changes (event-triggered invalidation)
Cache scope: Configure whether the cache is per-user, per-organization, or global. Customer support agents should have organization-scoped caches (an answer appropriate for your organization may not be appropriate for another). General knowledge agents can share a global cache.
Model Routing and Tiered LLM Selection
Not every task requires a frontier model. Using GPT-4o or Claude 3.5 Sonnet for a simple classification task that GPT-4o mini handles correctly is paying 15-50x more than necessary.
Routing Strategy
Task complexity classification: Implement a lightweight classifier that categorizes each incoming request by complexity:
- Simple: Lookup, classification with few categories, short generation with clear template
- Moderate: Multi-step reasoning, extraction from complex documents, conditional logic
- Complex: Open-ended analysis, creative synthesis, nuanced judgment
Model assignment:
- Simple → GPT-4o mini, Claude 3 Haiku (cost: ~$0.15-0.30/M tokens)
- Moderate → Claude 3.5 Sonnet, GPT-4o (cost: ~$3-5/M tokens)
- Complex → Claude 3.5 Sonnet, GPT-4o (or o1 for deep reasoning tasks) (cost: $5-15/M tokens)
Fallback routing: If the cheaper model produces output below quality threshold (detected by automated evaluation), retry with the more expensive model. This "cascade" approach uses cheap models optimistically and only escalates when needed.
def route_to_model(task: AgentTask) -> str:
complexity = classify_task_complexity(task)
model_map = {
"simple": "claude-haiku-3",
"moderate": "claude-3-5-sonnet",
"complex": "claude-3-5-sonnet"
}
return model_map[complexity]
def execute_with_fallback(task: AgentTask):
primary_model = route_to_model(task)
result = execute_with_model(task, primary_model)
if not meets_quality_threshold(result):
# Escalate to more capable model
result = execute_with_model(task, "claude-3-5-sonnet")
return result
Realistic savings from model routing: In a mixed-workload agent fleet, 60-70% of tasks typically qualify as "simple." Routing these to cheap models achieves 50-70% cost reduction on that segment, translating to 30-50% overall cost reduction.
Prompt Caching (Provider-Level)
Anthropic and OpenAI offer prompt caching features that reduce the cost of repeated system prompts. When the system prompt (or any prefix of the prompt) is identical across multiple requests, cached tokens cost significantly less than fresh tokens.
Anthropic cache pricing: Cached input tokens cost ~10% of standard input token price ($0.30/M vs. $3/M for Sonnet). Cache write cost is $3.75/M (written once, then read at $0.30/M).
Effective strategy: Structure prompts so the stable portion (system prompt, examples, instructions) comes first and the variable portion (user input, retrieved context) comes last. The provider caches the stable prefix automatically.
Break-even calculation: Cache write costs 1.25x standard input token price; cache read costs 0.1x. Break-even is at 2 requests that share the prefix. Every request beyond the second is 90% cheaper for the cached portion.
For an agent with a 1,000-token system prompt running 1,000 requests per day:
- Without caching: 1,000 × 1,000 tokens × $3/M = $3/day input cost for system prompt alone
- With caching: $3.75 (one write) + 999 × 1,000 × $0.30/M = $0.30/day
- Daily savings: $2.70 (90% reduction on this component)
Batch Processing
For non-time-sensitive workloads (overnight report generation, batch document processing, scheduled data analysis), batch API calls offer significant cost reductions.
OpenAI Batch API: 50% cost reduction for requests submitted as batches with 24-hour completion windows. For overnight report generation, this alone halves the LLM API cost.
Anthropic Message Batches: Similar batch pricing for non-time-sensitive workloads.
Batch scheduling patterns:
- Collect report generation requests throughout the day, submit as batch at end-of-business
- Process document ingestion for RAG during off-peak hours as batch jobs
- Run compliance monitoring scans nightly as batches
Cost Monitoring and Attribution
Optimization requires knowing where costs are coming from. Implement cost monitoring from the first day of production:
Per-workflow cost tracking: Tag every LLM call with the workflow it belongs to. Calculate total cost per workflow per day. This reveals which agent behaviors are most expensive and prioritizes optimization effort.
Per-token attribution: Break down costs by input vs. output tokens, by prompt component (system prompt vs. context vs. user input), and by model. Cost attribution at this granularity enables targeted optimization.
Cost anomaly detection: Alert when daily costs spike more than 20% above the rolling 7-day average. Spikes indicate either legitimate volume increases (expected) or bugs (infinite loops, runaway context windows, prompt injection causing unusually long completions).
Cost per successful task: Divide total costs by successful task completions to get cost per unit of value. This is the metric that matters for ROI — if cost per task decreases while task volume and quality hold, optimization is working.
Frequently Asked Questions
How much can cost optimization realistically reduce LLM API costs?
In typical OpenClaw deployments, a systematic optimization effort addressing prompt compression, semantic caching, and model routing achieves 45-65% cost reduction compared to unoptimized deployments. The specific savings depend heavily on workload characteristics — agents with highly repetitive queries benefit most from caching; agents with diverse, unique queries benefit more from model routing.
Does semantic caching compromise response accuracy?
With proper threshold configuration, the accuracy impact is negligible — typically less than 0.5% degradation on factual tasks. The key is setting the similarity threshold appropriately for the task type. For tasks where subtle differences in the question lead to different correct answers, use higher similarity thresholds (0.96+) to ensure only truly equivalent queries are served from cache.
What is the latency impact of semantic caching?
Cache lookups (vector similarity search) add 5-15ms latency. Cache hits eliminate the LLM call latency (typically 500ms-3s). Net result: cached responses are 20-200x faster than non-cached responses. This is a latency improvement, not a degradation.
How do we implement cost monitoring without significant engineering effort?
OpenClaw's observability layer captures token counts and model selections for every execution automatically. ECOSIRE configures a cost dashboard during implementation that shows costs by workflow, model, and time period. No custom engineering is required — the monitoring infrastructure is part of the standard implementation.
At what scale do cost optimization measures become worthwhile?
Most optimization measures become worthwhile above $500/month in LLM API costs. Below that threshold, the engineering effort typically exceeds the savings. Above $2,000/month, systematic optimization is strongly recommended — the ROI on engineering time invested in optimization is very high at this scale.
Does switching to cheaper models compromise the quality of agent outputs?
For tasks where cheaper models genuinely provide equivalent quality, switching to them is pure savings. For tasks requiring deep reasoning, nuanced judgment, or complex synthesis, cheaper models produce noticeably worse outputs. The model routing pattern addresses this by using cheaper models only where they're appropriate and routing to premium models for tasks that require them. The key is empirical validation — test the cheaper model on your specific task before routing production traffic to it.
Next Steps
Cost optimization for AI agents is an ongoing discipline, not a one-time project. ECOSIRE's OpenClaw implementations include a cost optimization layer from day one — semantic caching, model routing, and prompt optimization are built into the deployment architecture rather than added as afterthoughts.
Explore ECOSIRE OpenClaw Services to discuss your cost optimization requirements, or review our maintenance and optimization retainer options to understand how ECOSIRE manages ongoing cost efficiency for production OpenClaw deployments.
Written by
ECOSIRE Research and Development Team
Building enterprise-grade digital products at ECOSIRE. Sharing insights on Odoo integrations, e-commerce automation, and AI-powered business solutions.
Related Articles
Case Study: AI Customer Support with OpenClaw Agents
How a SaaS company used OpenClaw AI agents to handle 84% of support tickets autonomously, cutting support costs by 61% while improving CSAT scores.
Government ERP ROI: Transparency, Efficiency, and Taxpayer Value
Quantify ERP ROI in government agencies through procurement savings, administrative efficiency, audit cost reduction, and improved taxpayer transparency with real case studies.
Healthcare ERP ROI: Compliance, Efficiency, and Patient Outcomes
Quantify healthcare ERP ROI across compliance, operational efficiency, and patient outcomes with real metrics, calculation frameworks, and payback period analysis.
More from Performance & Scalability
k6 Load Testing: Stress-Test Your APIs Before Launch
Master k6 load testing for Node.js APIs. Covers virtual user ramp-ups, thresholds, scenarios, HTTP/2, WebSocket testing, Grafana dashboards, and CI integration patterns.
Nginx Production Configuration: SSL, Caching, and Security
Nginx production configuration guide: SSL termination, HTTP/2, caching headers, security headers, rate limiting, reverse proxy setup, and Cloudflare integration patterns.
Odoo Performance Tuning: PostgreSQL and Server Optimization
Expert guide to Odoo 19 performance tuning. Covers PostgreSQL configuration, indexing, query optimization, Nginx caching, and server sizing for enterprise deployments.
Odoo vs Acumatica: Cloud ERP for Growing Businesses
Odoo vs Acumatica compared for 2026: unique pricing models, scalability, manufacturing depth, and which cloud ERP fits your growth trajectory.
Testing and Monitoring AI Agents in Production
A complete guide to testing and monitoring AI agents in production environments. Covers evaluation frameworks, observability, drift detection, and incident response for OpenClaw deployments.
Compliance Monitoring Agents with OpenClaw
Deploy OpenClaw AI agents for continuous compliance monitoring. Automate regulatory checks, policy enforcement, audit trail generation, and compliance reporting.