Optimizing AI Agent Costs: Token Usage and Caching

Practical strategies for reducing AI agent operational costs through token optimization, caching, model routing, and usage monitoring. Real savings from production OpenClaw deployments.

E
ECOSIRE Research and Development Team
|March 19, 202611 min read2.5k Words|

Part of our Performance & Scalability series

Read the complete guide

Optimizing AI Agent Costs: Token Usage and Caching

AI agent operational costs can scale from manageable to alarming surprisingly quickly. An agent processing 10 transactions per day is inexpensive. The same agent processing 5,000 transactions per day, with each transaction requiring 3-4 LLM calls with large context windows, can generate thousands of dollars in monthly API costs — costs that weren't in the original ROI model.

Cost optimization is not optional for production-scale AI deployments. It's the difference between an agent that delivers positive ROI and one that erodes it. This guide covers the practical strategies that reduce costs by 40-70% in typical OpenClaw deployments without compromising output quality.

Key Takeaways

  • Token optimization (prompt compression, context pruning) reduces API costs 25-40% with no quality loss
  • Semantic caching eliminates LLM calls for repeated or similar requests, reducing costs 30-60% in many workloads
  • Model routing uses cheap models for simple tasks and expensive models only when needed
  • Prompt caching (where available from providers) reduces input token costs for repetitive system prompts
  • Batch processing reduces per-call overhead for high-volume, non-time-sensitive workloads
  • Cost monitoring with per-workflow attribution identifies the most expensive agent behaviors
  • Streaming reduces time-to-first-token latency for user-facing agents without increasing total cost
  • A comprehensive cost optimization strategy typically reduces total LLM spend by 45-65% vs. unoptimized deployments

Understanding AI Agent Cost Drivers

Before optimizing costs, understand what drives them. LLM API costs are primarily based on token consumption:

Input tokens: Every token sent to the model costs money — the system prompt, the user message, the retrieved context (RAG chunks), conversation history, and any examples (few-shot). Input token costs are typically 2-5x lower than output token costs for current frontier models.

Output tokens: Tokens generated by the model in its response. Verbose outputs cost more. Reasoning steps (chain-of-thought) cost more than direct answers. Structured JSON outputs cost more than prose if the JSON has many fields.

Call volume: Every LLM call has a minimum cost. Multi-step agents that make 5 LLM calls per task cost 5x more than single-call agents — but may produce much better results. The key is eliminating unnecessary calls.

Model selection: The cost difference between models is enormous. Claude 3 Haiku costs ~50x less than Claude 3 Opus per token. GPT-4o costs ~15x more than GPT-4o mini. Using a frontier model for every task is the most common source of unnecessary cost.

A realistic cost scenario:

Agent processes 1,000 customer service tickets per day. Each ticket requires:

  • System prompt: 800 tokens
  • Retrieved context: 1,200 tokens
  • Ticket content: 400 tokens
  • Total input: 2,400 tokens
  • Response: 600 tokens

Using Claude 3.5 Sonnet ($3/M input, $15/M output):

  • Daily cost: 1,000 × [(2,400 × $3/M) + (600 × $15/M)] = $16.20/day = $486/month

With optimization (shown in this guide), this drops to $150-$200/month — a 60% reduction.


Prompt Compression and Token Reduction

System Prompt Optimization

System prompts are sent with every request. A bloated 2,000-token system prompt that could be compressed to 800 tokens without information loss is paying 2.5x more than necessary on input tokens.

Techniques:

Remove redundancy: Review your system prompts for information that's restated in multiple places. Consolidate.

Use compressed language: Avoid conversational preamble. Compare:

Verbose (47 tokens): "You are a helpful assistant that is skilled at reviewing contracts. Your job is to carefully read through the contract and identify any clauses that might represent a risk to our company."

Compressed (23 tokens): "You are a contract risk analyst. Identify clauses representing risk to the client company."

The compressed version conveys identical instructions. LLMs respond to semantic content, not word count.

Use structured formatting: Numbered lists and bullet points convey information more densely than paragraphs.

Remove examples from system prompt when using few-shot: If you have examples in both the system prompt and the user message, you're paying for them twice. Consolidate to one location.

Audit system prompt length regularly: System prompts tend to grow as teams add instructions over time without removing outdated ones. A quarterly review typically finds 20-30% of system prompt content can be removed or compressed.

Context Window Management

RAG (Retrieval Augmented Generation) retrievals are one of the largest cost drivers for knowledge-intensive agents. Each retrieved chunk is input tokens. Unoptimized RAG frequently retrieves more context than needed.

Chunk size optimization: Smaller chunks (256-512 tokens) retrieved in higher quantities often outperform large chunks (1,000+ tokens) for factual question answering. Smaller chunks are also cheaper because irrelevant passages within a large chunk aren't retrieved.

Retrieval count tuning: If your agent retrieves 10 chunks per query but consistently uses information from only the top 2-3, reduce the retrieval count. Monitor which retrieved chunks are actually referenced in agent outputs.

Relevance filtering: Apply a relevance score threshold — only include retrieved chunks above the threshold in the context. Chunks with low relevance add cost without improving quality.

Conversation history pruning: For multi-turn agents, conversation history grows with each turn. Older turns are often less relevant. Implement a summarization strategy: after 8-10 turns, summarize early conversation into a compressed summary (200-300 tokens) rather than retaining full turn-by-turn history.

def manage_conversation_history(messages: list, max_tokens: int = 2000) -> list:
    """Prune conversation history to stay within token budget"""
    # Always keep system message and last N user/assistant turns
    if count_tokens(messages) <= max_tokens:
        return messages

    # Summarize early conversation if too long
    early_messages = messages[1:-6]  # Exclude system + recent 3 turns
    summary = summarize_conversation(early_messages)

    return [
        messages[0],  # System message
        {"role": "user", "content": f"[Earlier conversation summary: {summary}]"},
        *messages[-6:]  # Recent 3 turns
    ]

Semantic Caching

Semantic caching is the highest-impact cost optimization for agents handling repetitive queries. It stores the result of LLM calls and returns cached results for subsequent requests that are semantically similar — even if not identical.

How Semantic Caching Works

  1. When an LLM call is made, compute an embedding vector for the input (prompt + context)
  2. Search the cache for stored results with high vector similarity to the current input
  3. If similarity exceeds the threshold, return the cached result (no LLM call)
  4. If not, make the LLM call and store the result with its embedding

The critical insight: many real-world requests are semantically similar even when not textually identical. "What is the return policy for orders placed in the last 30 days?" and "Can I return something I ordered 3 weeks ago?" are different words but the same question — semantic caching can serve the second from the cache of the first.

Cache Hit Rate by Agent Type

Agent TypeExpected Cache Hit RateRationale
FAQ / customer support50-75%Common questions repeat frequently
Data lookup (product info, pricing)40-65%Same products queried repeatedly
Document classification30-50%Similar document types appear repeatedly
Report narrative generation20-40%Trends are similar across periods
Custom workflow orchestration5-15%Each case is highly unique
Data analysis10-25%Questions are varied but some repeat

For customer support agents with 65% cache hit rate, semantic caching reduces LLM call volume — and therefore LLM cost — by 65%.

Cache Configuration

Similarity threshold: The threshold for declaring two requests "similar enough" for cache reuse. Higher threshold = fewer cache hits but higher accuracy. Lower threshold = more cache hits but risk of returning subtly wrong answers for dissimilar requests.

For factual queries, a similarity threshold of 0.92-0.95 is typically safe. For analytical or reasoning tasks, use a higher threshold (0.97+) to avoid returning incorrect analysis for subtly different questions.

Cache TTL: Different cache entry types should have different expiration periods:

  • Product pricing: 1-4 hours (prices change)
  • Policy information: 24-48 hours (policies rarely change)
  • General knowledge: 7 days (very stable information)
  • Generated reports: Cache until the underlying data changes (event-triggered invalidation)

Cache scope: Configure whether the cache is per-user, per-organization, or global. Customer support agents should have organization-scoped caches (an answer appropriate for your organization may not be appropriate for another). General knowledge agents can share a global cache.


Model Routing and Tiered LLM Selection

Not every task requires a frontier model. Using GPT-4o or Claude 3.5 Sonnet for a simple classification task that GPT-4o mini handles correctly is paying 15-50x more than necessary.

Routing Strategy

Task complexity classification: Implement a lightweight classifier that categorizes each incoming request by complexity:

  • Simple: Lookup, classification with few categories, short generation with clear template
  • Moderate: Multi-step reasoning, extraction from complex documents, conditional logic
  • Complex: Open-ended analysis, creative synthesis, nuanced judgment

Model assignment:

  • Simple → GPT-4o mini, Claude 3 Haiku (cost: ~$0.15-0.30/M tokens)
  • Moderate → Claude 3.5 Sonnet, GPT-4o (cost: ~$3-5/M tokens)
  • Complex → Claude 3.5 Sonnet, GPT-4o (or o1 for deep reasoning tasks) (cost: $5-15/M tokens)

Fallback routing: If the cheaper model produces output below quality threshold (detected by automated evaluation), retry with the more expensive model. This "cascade" approach uses cheap models optimistically and only escalates when needed.

def route_to_model(task: AgentTask) -> str:
    complexity = classify_task_complexity(task)

    model_map = {
        "simple": "claude-haiku-3",
        "moderate": "claude-3-5-sonnet",
        "complex": "claude-3-5-sonnet"
    }
    return model_map[complexity]

def execute_with_fallback(task: AgentTask):
    primary_model = route_to_model(task)
    result = execute_with_model(task, primary_model)

    if not meets_quality_threshold(result):
        # Escalate to more capable model
        result = execute_with_model(task, "claude-3-5-sonnet")

    return result

Realistic savings from model routing: In a mixed-workload agent fleet, 60-70% of tasks typically qualify as "simple." Routing these to cheap models achieves 50-70% cost reduction on that segment, translating to 30-50% overall cost reduction.


Prompt Caching (Provider-Level)

Anthropic and OpenAI offer prompt caching features that reduce the cost of repeated system prompts. When the system prompt (or any prefix of the prompt) is identical across multiple requests, cached tokens cost significantly less than fresh tokens.

Anthropic cache pricing: Cached input tokens cost ~10% of standard input token price ($0.30/M vs. $3/M for Sonnet). Cache write cost is $3.75/M (written once, then read at $0.30/M).

Effective strategy: Structure prompts so the stable portion (system prompt, examples, instructions) comes first and the variable portion (user input, retrieved context) comes last. The provider caches the stable prefix automatically.

Break-even calculation: Cache write costs 1.25x standard input token price; cache read costs 0.1x. Break-even is at 2 requests that share the prefix. Every request beyond the second is 90% cheaper for the cached portion.

For an agent with a 1,000-token system prompt running 1,000 requests per day:

  • Without caching: 1,000 × 1,000 tokens × $3/M = $3/day input cost for system prompt alone
  • With caching: $3.75 (one write) + 999 × 1,000 × $0.30/M = $0.30/day
  • Daily savings: $2.70 (90% reduction on this component)

Batch Processing

For non-time-sensitive workloads (overnight report generation, batch document processing, scheduled data analysis), batch API calls offer significant cost reductions.

OpenAI Batch API: 50% cost reduction for requests submitted as batches with 24-hour completion windows. For overnight report generation, this alone halves the LLM API cost.

Anthropic Message Batches: Similar batch pricing for non-time-sensitive workloads.

Batch scheduling patterns:

  • Collect report generation requests throughout the day, submit as batch at end-of-business
  • Process document ingestion for RAG during off-peak hours as batch jobs
  • Run compliance monitoring scans nightly as batches

Cost Monitoring and Attribution

Optimization requires knowing where costs are coming from. Implement cost monitoring from the first day of production:

Per-workflow cost tracking: Tag every LLM call with the workflow it belongs to. Calculate total cost per workflow per day. This reveals which agent behaviors are most expensive and prioritizes optimization effort.

Per-token attribution: Break down costs by input vs. output tokens, by prompt component (system prompt vs. context vs. user input), and by model. Cost attribution at this granularity enables targeted optimization.

Cost anomaly detection: Alert when daily costs spike more than 20% above the rolling 7-day average. Spikes indicate either legitimate volume increases (expected) or bugs (infinite loops, runaway context windows, prompt injection causing unusually long completions).

Cost per successful task: Divide total costs by successful task completions to get cost per unit of value. This is the metric that matters for ROI — if cost per task decreases while task volume and quality hold, optimization is working.


Frequently Asked Questions

How much can cost optimization realistically reduce LLM API costs?

In typical OpenClaw deployments, a systematic optimization effort addressing prompt compression, semantic caching, and model routing achieves 45-65% cost reduction compared to unoptimized deployments. The specific savings depend heavily on workload characteristics — agents with highly repetitive queries benefit most from caching; agents with diverse, unique queries benefit more from model routing.

Does semantic caching compromise response accuracy?

With proper threshold configuration, the accuracy impact is negligible — typically less than 0.5% degradation on factual tasks. The key is setting the similarity threshold appropriately for the task type. For tasks where subtle differences in the question lead to different correct answers, use higher similarity thresholds (0.96+) to ensure only truly equivalent queries are served from cache.

What is the latency impact of semantic caching?

Cache lookups (vector similarity search) add 5-15ms latency. Cache hits eliminate the LLM call latency (typically 500ms-3s). Net result: cached responses are 20-200x faster than non-cached responses. This is a latency improvement, not a degradation.

How do we implement cost monitoring without significant engineering effort?

OpenClaw's observability layer captures token counts and model selections for every execution automatically. ECOSIRE configures a cost dashboard during implementation that shows costs by workflow, model, and time period. No custom engineering is required — the monitoring infrastructure is part of the standard implementation.

At what scale do cost optimization measures become worthwhile?

Most optimization measures become worthwhile above $500/month in LLM API costs. Below that threshold, the engineering effort typically exceeds the savings. Above $2,000/month, systematic optimization is strongly recommended — the ROI on engineering time invested in optimization is very high at this scale.

Does switching to cheaper models compromise the quality of agent outputs?

For tasks where cheaper models genuinely provide equivalent quality, switching to them is pure savings. For tasks requiring deep reasoning, nuanced judgment, or complex synthesis, cheaper models produce noticeably worse outputs. The model routing pattern addresses this by using cheaper models only where they're appropriate and routing to premium models for tasks that require them. The key is empirical validation — test the cheaper model on your specific task before routing production traffic to it.


Next Steps

Cost optimization for AI agents is an ongoing discipline, not a one-time project. ECOSIRE's OpenClaw implementations include a cost optimization layer from day one — semantic caching, model routing, and prompt optimization are built into the deployment architecture rather than added as afterthoughts.

Explore ECOSIRE OpenClaw Services to discuss your cost optimization requirements, or review our maintenance and optimization retainer options to understand how ECOSIRE manages ongoing cost efficiency for production OpenClaw deployments.

E

Written by

ECOSIRE Research and Development Team

Building enterprise-grade digital products at ECOSIRE. Sharing insights on Odoo integrations, e-commerce automation, and AI-powered business solutions.

Chat on WhatsApp