LLM API Cost Optimization: 7 Techniques to Cut Your Bill by 60%
A practical guide to LLM API pricing, plus proven techniques like prompt compression, model selection, caching, and batch processing to dramatically reduce your API spend — with code examples.
LLM APIs charge by the token — roughly one token per word (or ~1.5 Chinese characters). Many teams see their AI API bills explode after launch, but most of the waste is avoidable. Here are 7 proven techniques to bring costs down.
Understanding the Bill: Input + Output Tokens
Total cost = Input tokens × price + Output tokens × price
- Input: Everything you send to the model (system prompt + conversation history + current message)
- Output: The model’s response
Output tokens typically cost 3–4× more than input tokens, so limiting output length has the highest ROI.
Tip 1: Choose the Right Model — Don’t Use a Sledgehammer for a Nail
The simplest way to save money: don’t use the most expensive model for simple tasks.
| Task type | Recommended model | Relative cost |
|---|---|---|
| Classification, keyword extraction | GPT-4o mini / Claude Haiku | 1× |
| General Q&A, summarization | GPT-4o mini | ~5× |
| Complex reasoning, code generation | GPT-4o / Claude 3.5 Sonnet | ~15× |
| Most complex tasks | o1 / Claude Opus | ~50× |
Many use cases work perfectly with GPT-4o mini at 1/10th the cost of GPT-4o.
Tip 2: Compress Your System Prompt
System prompts consume input tokens on every single request. Trim them ruthlessly:
# ❌ Verbose (~80 tokens)
system = """
You are a highly professional customer service assistant. Your responsibility
is to answer user questions about our products. Please ensure your responses
are accurate, helpful, and maintain a friendly tone. If you don't know the
answer, say so directly — never fabricate information.
"""
# ✅ Concise (~20 tokens)
system = "Customer support. Answer product questions accurately. Say 'I don't know' when unsure."
Save 60 tokens per request × 100,000 daily requests = 180 million tokens per month.
Tip 3: Limit Output Length
Tell the model explicitly to keep answers short:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
max_tokens=200, # ← hard cap
)
Also reinforce this in the prompt: “Answer in under 100 words.” Both together works best.
Tip 4: Trim Conversation History
In multi-turn conversations, the history grows and token usage can grow exponentially:
def trim_history(messages, max_tokens=3000):
"""Keep the most recent N tokens of history; always retain the system prompt."""
system = [m for m in messages if m["role"] == "system"]
others = [m for m in messages if m["role"] != "system"]
# Rough estimate: 4 chars ≈ 1 token
total = sum(len(m["content"]) // 4 for m in others)
while total > max_tokens and len(others) > 1:
removed = others.pop(0) # drop oldest message
total -= len(removed["content"]) // 4
return system + others
Tip 5: Cache Repeated Requests
Don’t call the API for identical requests:
import hashlib, json
_cache = {}
def cache_key(model, messages):
payload = json.dumps({"model": model, "messages": messages}, sort_keys=True)
return hashlib.md5(payload.encode()).hexdigest()
def cached_completion(model, messages, **kwargs):
key = cache_key(model, messages)
if key in _cache:
return _cache[key] # free — no tokens consumed
result = client.chat.completions.create(
model=model, messages=messages, **kwargs
)
_cache[key] = result
return result
For FAQ-style workloads, cache hit rates of 60%+ are common. Use Redis in production.
Tip 6: Batch API for Non-realtime Tasks
For offline workloads (data analysis, bulk translation, etc.), the Batch API costs half the price of the real-time API:
# Prepare batch tasks
tasks = [
{"custom_id": f"task-{i}", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": text}]}}
for i, text in enumerate(texts_to_process)
]
# Submit batch (completes within 24h at 50% discount)
batch = client.batches.create(
input_file_id=upload_file(tasks),
endpoint="/v1/chat/completions",
completion_window="24h",
)
Tip 7: Monitor Token Usage — Find the Real Culprits
Know where your money is going before you optimize:
# Log usage after every request
usage = response.usage
print(f"This call: input={usage.prompt_tokens}, output={usage.completion_tokens}")
# Write to a database and aggregate by feature
log_usage(
feature="chat",
model=model,
input_tokens=usage.prompt_tokens,
output_tokens=usage.completion_tokens,
)
Typically 20% of features consume 80% of tokens. Focus your optimization effort there.
Summary
| Technique | Expected savings |
|---|---|
| Downgrade model | 50–90% |
| Compress system prompt | 10–30% |
| Limit output length | 20–50% |
| Trim conversation history | 20–60% |
| Cache repeated requests | 30–60% |
| Batch API | 50% |
Combined, cutting costs by 60% overall is very achievable.
👉 Try NixAPI — transparent pricing, no monthly fees, free to start.
Try NixAPI Now
Reliable LLM API relay for OpenAI, Claude, Gemini, DeepSeek, Qwen, and Grok with ¥1 = $1 top-up
Sign Up Free