Utilix knowledge base
Prompt Token Optimization — How to Reduce AI API Bills
Published May 3, 2026
Token usage directly determines API cost. These techniques reduce token consumption without meaningful loss in output quality.
1. Trim Your System Prompt
System prompts are attached to every request. A 500-token system prompt on 10,000 daily requests adds 5 million input tokens per month — roughly $12.50 at GPT-4o input pricing.
Audit your system prompt for:
- Repetition (the same instruction stated multiple ways)
- Examples that are never needed (move to user turn instead)
- Long caveats that rarely affect output
A well-edited system prompt of 100–200 tokens often performs as well as a 500-token one.
2. Truncate Conversation History
Passing the full conversation history with each request is the most common cause of runaway costs in chat applications. As a conversation grows, input tokens grow linearly.
Strategies:
- Keep only the last N turns (e.g., last 5 exchanges)
- Summarize older turns into a compact paragraph and prepend that instead
- Store structured facts extracted from the conversation rather than raw transcript
3. Choose the Right Model for Each Task
Not every task needs the most capable model. Split your workload by complexity:
| Task | Recommended tier |
|---|---|
| Classification, intent detection | GPT-4o mini / Gemini Flash |
| Summarization, Q&A | GPT-4o mini / Claude Haiku |
| Complex reasoning, code review | GPT-4o / Claude Sonnet |
| Research synthesis, nuanced writing | Claude Sonnet / GPT-4o |
Running 80% of requests on the budget tier and 20% on the mid-tier can cut average costs by 50–70%.
4. Use Structured Output Formats
Asking the model to return clean JSON instead of narrative prose reduces output tokens and parsing effort:
Instead of: "The sentiment is positive because the user mentioned..."
Ask for: {"sentiment": "positive", "confidence": 0.92}
Structured formats also reduce the need for follow-up correction requests.
5. Cache Repeated Context
If your application passes the same large document, code file, or knowledge base excerpt with every request, context caching can reduce input costs significantly:
- Anthropic offers prompt caching at 90% discount for repeated prefixes (min 1,024 tokens)
- Google offers context caching for Gemini at a per-token-per-second storage rate
- OpenAI offers automatic caching discounts for certain repeated prompt patterns
The setup is a few lines of code and pays for itself immediately on high-volume applications.
6. Write Concise User Prompts
Users often write verbose prompts when a shorter one would work equally well. If your application reformats or preprocesses user input, extract the key intent rather than passing raw text.
A 200-word user message might be reducible to a 20-word task description with no loss in output quality.
7. Batch Where Possible
Some providers (notably OpenAI) offer a batch API at 50% discount for requests that do not need a real-time response. If you are running bulk jobs (document processing, data extraction, evaluation), the batch endpoint halves your cost.
Quick Reference
| Technique | Typical savings |
|---|---|
| Trim system prompt by 50% | 5–15% on input cost |
| Truncate conversation history | 10–40% depending on conversation length |
| Downgrade to budget model tier | 50–80% total |
| Context caching (Anthropic/Google) | Up to 90% on repeated content |
| Batch API (OpenAI) | 50% flat |
Start with the model choice — it has the largest impact. Then address conversation history if you are building a chat application.