Step 3: optimize LLM spend with benchmarks and phased roll-out

At the end of this step, you should have answers to the following questions:

Which use case is the first optimization target, and why?
What’s the success criteria? (Minimum acceptable performance + primary metric)
What did the tests reveal? (Which configuration won, and by how much?)

0. Intro

You’ve identified where your money goes (Step 2) and what you’re optimizing for (Step 1). Now comes the fun part: finding cheaper, faster, or better configurations. But here’s the trap. Most teams see a 90% cost reduction in a blog post, swap their model, and immediately regret it. Quality tanks. Users complain. They roll back and assume “cheaper models don’t work.” That’s not true. What doesn’t work is blind changes without testing. The right approach is systematic: test one variable at a time, measure the impact, keep what works. Always optimize in this order:

Prompt engineering (free, fast, often dramatic impact)
Model selection (where the biggest cost savings live)
Parameter tuning (the final 10-20% of improvement)

Break down each step.

1. Optimize your prompt first

Before you change the model or parameters, squeeze every drop of performance from your prompt. Small changes can have massive impact, and they cost nothing.

1.1 Be specific about format

❌ “Classify this email”

✅ “Classify this email. Return only: spam, urgent, or normal”

1.2 Constrain output length

❌ “Summarize this article”

✅ “Summarize this article in exactly 3 bullet points, max 15 words each”

1.3 Use structured output

❌ “Extract the customer’s name, email, and issue”

✅ “Return JSON: {"name": string, "email": string, "issue": string}"

1.4 Remove unnecessary instructions

Don’t say “You are a helpful assistant” if it doesn’t affect output
Don’t ask for explanations if you only need the answer
Don’t request markdown formatting if plain text works

Don’t over-optimize blindly. Test every prompt change. Sometimes verbosity helps. Sometimes “explain your reasoning” actually improves accuracy. Let data decide.

2. Test different models

Prompt optimization alone might have saved you enough to call it a win - and that’s perfectly valid. But if you’re ready to push further, model testing is where the really dramatic savings hide. Just know: this is also where quality can slip if you’re not careful. The secret. There are over 300 models available. Most teams only try 3 to 5.

2.1 Ignore the leaderboards

Before you start testing, ignore everything you’ve read about model benchmarks. Here’s the uncomfortable truth. Benchmarks don’t predict real-world performance on your use case. There are hundreds of benchmarks measuring model intelligence:

Massive Multitask Language Understanding (MMLU) (general knowledge)
HumanEval (code generation)
Graduate Question Answering (GPQA) (graduate-level reasoning)
HellaSwag (commonsense reasoning)
TruthfulQA (factual accuracy)
Big Bench Hard (BBH) (reasoning challenges)
MT-Bench (multi-turn conversations)
and 200+ more

What do these benchmarks actually measure? How well models respond to benchmarks. That’s it. They don’t measure:

Performance on your prompts
Performance on your data distribution
Performance on your edge cases
Cost efficiency for your use case
Latency in your infrastructure

A model that scores 94% on MMLU might be terrible at classifying your support tickets. A model that ranks #47 on the leaderboard might be perfect for generating your product descriptions. The correlation between benchmark scores and real-world performance on specific tasks is weak at best.

Why benchmarks mislead

1. They’re not your data Benchmarks test on curated datasets:

Academic questions with clear right answers
Sanitized inputs with no typos or edge cases
English-only (usually), when your users might write in Spanglish

Your real data is messy. Users make typos. They write run-on sentences. They reference context you need to infer. 2. They’re not your prompts Benchmarks use standardized prompts optimized for the test. Your production prompts are different—you’ve tuned them for your specific use case, added company context, constrained output format. 3. Gaming is rampant Models are increasingly trained on benchmark data. A model that scores 96% on HumanEval isn’t necessarily a better coder—it might just have seen those exact problems during training. 4. They ignore what you care about Does the benchmark measure:

Token efficiency? (No. Verbosity is fine in benchmarks)
Output consistency? (No. One correct answer is enough)
Latency? (No. Time doesn’t matter)
Cost per successful outcome? (No. Accuracy alone wins)

But these metrics determine whether a model actually works for your business.

The only benchmark that matters is your benchmark. Test on your data, with your prompts, measuring your metrics. Everything else is marketing.

2.2 Models to consider

OpenAI uses Generative Pre-trained Transformer (GPT) in product names such as GPT-4 and GPT-4o-mini. Don’t just test the obvious choices (GPT-4, Claude, Gemini). Explore: Lightweight models from major providers:

GPT-4o-mini, GPT-4.1-nano
Claude Haiku, Claude Sonnet
Gemini Flash, Gemini Flash 8 B

Open-source models via API:

Llama 3.1 (8 B, 70 B, 405 B)
Mistral 8x7 B, Mistral 8x22 B
Qwen2.5 (various sizes)
Command R, Command R+

Specialized models:

Anthropic models for analysis and reasoning
Cohere for classification and embeddings
OpenAI o1 for complex reasoning tasks

2.3 Testing approach

For the use case you prioritized in Step 2, run parallel tests on your actual data: Example: product description generator Run 1,000 real product titles through:

GPT-4o (baseline: $15/1M output tokens)
Claude Sonnet ($15/1M output tokens)
GPT-4o-mini ($0.60/1M output tokens)
Claude Haiku ($1.25/1M output tokens)
Llama 3.1 70 B ($0.88/1M output tokens)

Measure what actually matters for your business:

Quality: Manual review of 100 samples, or automated evaluation against your rubric
Cost: Actual tokens consumed × model pricing
Latency: P50, P95, P99 response times
Consistency: Do outputs vary wildly, or are they stable?
Edge case handling: How does it perform on your weird/broken/unusual inputs?

Typical outcome:

2-3 models meet your quality bar
The cheapest acceptable model is 60-95% cheaper than your current choice
One model is surprisingly good (often one you’ve never heard of)
The leaderboard winner might not even crack your top 3

Cast a wide net. That obscure model ranked #47 on the leaderboard might be perfect for your use case. The #1 model might be overkill. You won’t know until you test on your data.

3. Tune parameters for the final edge

After you’ve picked the right prompt and model, squeeze out the last 10-20% with parameter tuning.

Key parameters and their impact

Parameter	Range	When to increase	When to decrease	Impact on cost	Impact on quality
Temperature	0.0 - 2.0	Need creativity, variety, brainstorming	Need consistency, factual accuracy	None	High - controls randomness
Max tokens	1 - ∞	Outputs often truncated	Getting unnecessarily long outputs	Direct - fewer tokens = lower cost	Medium - truncation can hurt quality
Top P	0.0 - 1.0	Want more diverse vocabulary	Want more predictable outputs	None	Medium - controls word choice diversity
Frequency penalty	-2.0 - 2.0	Model repeats itself too much	Outputs feel unnatural or disjointed	None	Low - reduces repetition
Presence penalty	-2.0 - 2.0	Want model to explore new topics	Want model to stay focused	None	Low - encourages topic diversity
Stop sequences	Custom strings	Want to truncate at specific markers	Model stops too early	Direct - early stopping = fewer tokens	Low - mostly for formatting

Parameters interact. Changing temperature affects output length, which affects cost. Test configurations as a whole, not in isolation.

4. Common pitfalls to avoid

4.1 Optimizing without measurement

You can’t know if you’ve improved without baselines Set up tracking (Step 2) before you optimize

4.2 Changing too many variables at once

If you change prompt + model + parameters simultaneously, you won’t know what worked Test one variable at a time

4.3 Testing on toy datasets

10 examples won’t tell you how the model behaves at scale Use at least 100-500 real samples, ideally production traffic

4.4 Ignoring edge cases

Your model might work great on average but fail catastrophically on 1% of inputs Test the weird stuff, not just the happy path

4.5 Deploying winners too fast

Models behave differently under load Always do gradual releases with monitoring

4.6 Stopping after one optimization

You’ve got 5-10 use cases burning money Build a rhythm: optimize one use case per month

5. You’ve completed the framework

If you’ve followed all three steps, you now have:

Clear objectives (Step 1) - You know what you’re optimizing for and who makes decisions
Cost visibility (Step 2) - You know where every dollar goes and who owns it
Optimization wins (Step 3) - You’ve proven you can cut costs 40-90% without sacrificing quality

This is a competitive advantage. While other teams burn through budgets on inefficient infrastructure, you’re delivering better experiences for a fraction of the cost. The efficiency gap compounds. Keep optimizing.

Want to move faster? Narev eliminates the tedious parts—routing, testing, monitoring, deployment. Sign up for the free tier and optimize your first use case today.

Guides

FinOps for AI

Cost Optimization

Step 3: optimize LLM spend with benchmarks and phased roll-out

0. Intro

1. Optimize your prompt first

1.1 Be specific about format

1.2 Constrain output length

1.3 Use structured output

1.4 Remove unnecessary instructions

2. Test different models

2.1 Ignore the leaderboards

Why benchmarks mislead

2.2 Models to consider

2.3 Testing approach

3. Tune parameters for the final edge

Key parameters and their impact

4. Common pitfalls to avoid

4.1 Optimizing without measurement

4.2 Changing too many variables at once

4.3 Testing on toy datasets

4.4 Ignoring edge cases

4.5 Deploying winners too fast

4.6 Stopping after one optimization

5. You’ve completed the framework

Guides

FinOps for AI

Cost Optimization

Documentation Index

​0. Intro

​1. Optimize your prompt first

​1.1 Be specific about format

​1.2 Constrain output length

​1.3 Use structured output

​1.4 Remove unnecessary instructions

​2. Test different models

​2.1 Ignore the leaderboards

​Why benchmarks mislead

​2.2 Models to consider

​2.3 Testing approach

​3. Tune parameters for the final edge

​Key parameters and their impact

​4. Common pitfalls to avoid

​4.1 Optimizing without measurement

​4.2 Changing too many variables at once

​4.3 Testing on toy datasets

​4.4 Ignoring edge cases

​4.5 Deploying winners too fast

​4.6 Stopping after one optimization

​5. You’ve completed the framework

0. Intro

1. Optimize your prompt first

1.1 Be specific about format

1.2 Constrain output length

1.3 Use structured output

1.4 Remove unnecessary instructions

2. Test different models

2.1 Ignore the leaderboards

Why benchmarks mislead

2.2 Models to consider

2.3 Testing approach

3. Tune parameters for the final edge

Key parameters and their impact

4. Common pitfalls to avoid

4.1 Optimizing without measurement

4.2 Changing too many variables at once

4.3 Testing on toy datasets

4.4 Ignoring edge cases

4.5 Deploying winners too fast

4.6 Stopping after one optimization

5. You’ve completed the framework