Documentation Index
Fetch the complete documentation index at: https://narev.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
0. Intro
You’ve identified where your money goes (Step 2) and what you’re optimizing for (Step 1). Now comes the fun part: finding cheaper, faster, or better configurations. But here’s the trap. Most teams see a 90% cost reduction in a blog post, swap their model, and immediately regret it. Quality tanks. Users complain. They roll back and assume “cheaper models don’t work.” That’s not true. What doesn’t work is blind changes without testing. The right approach is systematic: test one variable at a time, measure the impact, keep what works. Always optimize in this order:- Prompt engineering (free, fast, often dramatic impact)
- Model selection (where the biggest cost savings live)
- Parameter tuning (the final 10-20% of improvement)
1. Optimize your prompt first
Before you change the model or parameters, squeeze every drop of performance from your prompt. Small changes can have massive impact, and they cost nothing.1.1 Be specific about format
❌ “Classify this email”
✅ “Classify this email. Return only: spam, urgent, or normal”
1.2 Constrain output length
❌ “Summarize this article”
✅ “Summarize this article in exactly 3 bullet points, max 15 words each”
1.3 Use structured output
❌ “Extract the customer’s name, email, and issue”
✅ “Return JSON: {"name": string, "email": string, "issue": string}"
1.4 Remove unnecessary instructions
- Don’t say “You are a helpful assistant” if it doesn’t affect output
- Don’t ask for explanations if you only need the answer
- Don’t request markdown formatting if plain text works
2. Test different models
Prompt optimization alone might have saved you enough to call it a win - and that’s perfectly valid. But if you’re ready to push further, model testing is where the really dramatic savings hide. Just know: this is also where quality can slip if you’re not careful. The secret. There are over 300 models available. Most teams only try 3 to 5.2.1 Ignore the leaderboards
Before you start testing, ignore everything you’ve read about model benchmarks. Here’s the uncomfortable truth. Benchmarks don’t predict real-world performance on your use case. There are hundreds of benchmarks measuring model intelligence:- Massive Multitask Language Understanding (MMLU) (general knowledge)
- HumanEval (code generation)
- Graduate Question Answering (GPQA) (graduate-level reasoning)
- HellaSwag (commonsense reasoning)
- TruthfulQA (factual accuracy)
- Big Bench Hard (BBH) (reasoning challenges)
- MT-Bench (multi-turn conversations)
- and 200+ more
- Performance on your prompts
- Performance on your data distribution
- Performance on your edge cases
- Cost efficiency for your use case
- Latency in your infrastructure
Why benchmarks mislead
1. They’re not your data Benchmarks test on curated datasets:- Academic questions with clear right answers
- Sanitized inputs with no typos or edge cases
- English-only (usually), when your users might write in Spanglish
- Token efficiency? (No. Verbosity is fine in benchmarks)
- Output consistency? (No. One correct answer is enough)
- Latency? (No. Time doesn’t matter)
- Cost per successful outcome? (No. Accuracy alone wins)
2.2 Models to consider
OpenAI uses Generative Pre-trained Transformer (GPT) in product names such as GPT-4 and GPT-4o-mini. Don’t just test the obvious choices (GPT-4, Claude, Gemini). Explore: Lightweight models from major providers:- GPT-4o-mini, GPT-4.1-nano
- Claude Haiku, Claude Sonnet
- Gemini Flash, Gemini Flash 8 B
- Llama 3.1 (8 B, 70 B, 405 B)
- Mistral 8x7 B, Mistral 8x22 B
- Qwen2.5 (various sizes)
- Command R, Command R+
- Anthropic models for analysis and reasoning
- Cohere for classification and embeddings
- OpenAI o1 for complex reasoning tasks
2.3 Testing approach
For the use case you prioritized in Step 2, run parallel tests on your actual data: Example: product description generator Run 1,000 real product titles through:- GPT-4o (baseline: $15/1M output tokens)
- Claude Sonnet ($15/1M output tokens)
- GPT-4o-mini ($0.60/1M output tokens)
- Claude Haiku ($1.25/1M output tokens)
- Llama 3.1 70 B ($0.88/1M output tokens)
- Quality: Manual review of 100 samples, or automated evaluation against your rubric
- Cost: Actual tokens consumed × model pricing
- Latency: P50, P95, P99 response times
- Consistency: Do outputs vary wildly, or are they stable?
- Edge case handling: How does it perform on your weird/broken/unusual inputs?
- 2-3 models meet your quality bar
- The cheapest acceptable model is 60-95% cheaper than your current choice
- One model is surprisingly good (often one you’ve never heard of)
- The leaderboard winner might not even crack your top 3
3. Tune parameters for the final edge
After you’ve picked the right prompt and model, squeeze out the last 10-20% with parameter tuning.Key parameters and their impact
| Parameter | Range | When to increase | When to decrease | Impact on cost | Impact on quality |
|---|---|---|---|---|---|
| Temperature | 0.0 - 2.0 | Need creativity, variety, brainstorming | Need consistency, factual accuracy | None | High - controls randomness |
| Max tokens | 1 - ∞ | Outputs often truncated | Getting unnecessarily long outputs | Direct - fewer tokens = lower cost | Medium - truncation can hurt quality |
| Top P | 0.0 - 1.0 | Want more diverse vocabulary | Want more predictable outputs | None | Medium - controls word choice diversity |
| Frequency penalty | -2.0 - 2.0 | Model repeats itself too much | Outputs feel unnatural or disjointed | None | Low - reduces repetition |
| Presence penalty | -2.0 - 2.0 | Want model to explore new topics | Want model to stay focused | None | Low - encourages topic diversity |
| Stop sequences | Custom strings | Want to truncate at specific markers | Model stops too early | Direct - early stopping = fewer tokens | Low - mostly for formatting |
4. Common pitfalls to avoid
4.1 Optimizing without measurement
You can’t know if you’ve improved without baselines Set up tracking (Step 2) before you optimize4.2 Changing too many variables at once
If you change prompt + model + parameters simultaneously, you won’t know what worked Test one variable at a time4.3 Testing on toy datasets
10 examples won’t tell you how the model behaves at scale Use at least 100-500 real samples, ideally production traffic4.4 Ignoring edge cases
Your model might work great on average but fail catastrophically on 1% of inputs Test the weird stuff, not just the happy path4.5 Deploying winners too fast
Models behave differently under load Always do gradual releases with monitoring4.6 Stopping after one optimization
You’ve got 5-10 use cases burning money Build a rhythm: optimize one use case per month5. You’ve completed the framework
If you’ve followed all three steps, you now have:- Clear objectives (Step 1) - You know what you’re optimizing for and who makes decisions
- Cost visibility (Step 2) - You know where every dollar goes and who owns it
- Optimization wins (Step 3) - You’ve proven you can cut costs 40-90% without sacrificing quality