RAG vs. Fine-Tuning vs. Prompt Engineering: Stop Picking the Wrong One

Ayra ix · 9 min read · Talk to the AI panel about this article as you read

A decision framework for people who actually need results

Technique Overview

RAG

Retrieval-Augmented Generation

Cost
$50–500/mo (vector DB + embeddings)

Latency
+50–200ms retrieval overhead

Best use
Frequently updated knowledge, private docs, multi-source Q&A

Fatal flaw: wrong chunks = confident wrong answer. Invisible without chunk logging.

Fine-Tuning

Supervised Model Training

Cost
$5k–50k+ per training run + update cycles

Latency
No retrieval overhead at inference

Best use
Consistent format at scale, offline inference, stable domain style

Fatal flaw: stale the moment your domain shifts. Update cycle cost compounds.

Prompt Engineering

Zero Infra, Zero Training

Cost
$0 setup + inference tokens only

Latency
Baseline model latency, no overhead

Best use
Everything, until it genuinely fails. Start here, always.

Fatal flaw: dismissed too early. "Prompt engineering" ≠ one-line system prompt.

A team I know spent 3 months and $40,000 fine-tuning a model to answer questions about their internal product documentation. They hired a contractor, stood up the training pipeline, ran evals, deployed it. The result performed at roughly the same level as a well-prompted base model with the documentation pasted directly into the context window. By the time they ran that comparison, the documentation had been updated twice during training and the fine-tuned model was already stale before it shipped.

The $40k Fine-Tuning Failure — Cost Breakdown

Contractor scoping + data audit $4,200 2 weeks, defined the "right" approach as fine-tuning before testing alternatives

Training data preparation $8,500 Q&A pair extraction, cleaning, formatting ~2,000 examples

GPU training runs (×4 iterations) $6,800 Cloud GPU hours + failed runs; each eval required a full retrain

Evaluation and iteration $9,000 No fixed eval set defined upfront — each round redefined what "good" meant

Deployment + integration $7,500 Hosting, endpoint infra, rollout — on a model that was already stale

Opportunity cost (team time) $14,000 Engineering hours diverted from roadmap features for 3 months

Total: ~$50,000 | Comparable baseline (prompted GPT-4 + docs in context): ~$200/mo

The engineers weren't incompetent. They made a technique decision before they understood the problem. That's the failure — not the fine-tuning itself, but picking it first, before ruling out the cheaper options. I've seen the same pattern often enough that I now treat it as the default outcome when technique selection happens in the wrong order.

The right order is: prompt engineering, then RAG, then fine-tuning. Not because the techniques form a neat hierarchy, but because each step only makes sense after the previous one has genuinely failed. Most problems stop at step one.

The Decision You're Actually Making

Before reaching for any technique, three questions matter: How often does the knowledge your system needs change? How consistent does the output format need to be at scale? And what's the real cost of updating when requirements shift?

If knowledge changes faster than you can reasonably retrain — product docs, legal updates, internal KB — that's RAG's domain, not fine-tuning's. If you need the same output structure reproduced identically across 50,000 inferences a day with no context overhead, fine-tuning earns its cost. If neither of those is clearly true, you haven't tried prompt engineering hard enough yet.

Decision Flowchart — Where to Start

Decision Matrix

Scenario	Best Approach	Why	Red Flags
Knowledge updates monthly (docs, policies, KB)	RAG	Re-embedding chunks is fast and cheap. Retraining a model is not. You'd ship a stale model before the first update cycle completes.	Choosing fine-tuning for "better accuracy" — you're paying to memorize data that will be wrong in 30 days.
Consistent output format at high volume (>50k inferences/day)	Fine-Tune	Eliminates per-request few-shot overhead. Bakes the schema into weights. Saves tokens at scale if format is truly stable.	Domain isn't actually stable. If requirements shift quarterly, the update cycle cost will exceed the savings in 6 months.
Exploratory / prototyping / early product	Prompt Eng.	Zero infra commitment. Immediate iteration. You don't know your exact requirements yet — committing to a technique before you do is the $40k mistake.	Jumping to RAG or fine-tuning because it "feels more serious." Prompt engineering is the baseline everything else is measured against.
Private/sensitive data that can't leave org	RAG + local model	Embed privately, retrieve locally, generate on-prem. No sensitive data ever hits an external training pipeline or shared API endpoint.	Cloud fine-tuning APIs — your training data goes to a third-party training pipeline, even with enterprise agreements.
Budget under $1k total	Prompt Eng.	Fine-tuning costs are prohibitive at this budget. Even a minimal RAG setup adds infra overhead. Squeeze everything from a well-crafted system prompt first.	Any vendor who tells you fine-tuning is the only path for your use case without asking about your update cycle. That's not advice, that's a sale.
High volume inference, stable domain, offline required	Fine-Tune	No retrieval latency, no vector DB dependency, no context window overhead. The ROI calculation actually holds when all three conditions are true simultaneously.	Any one of the three conditions being false. Volume alone, or stability alone, doesn't justify fine-tuning. All three together do.

The fast filter: Dynamic or frequently updated knowledge → RAG. Consistent style/format at volume → fine-tuning. Anything else → prompt engineering first. If the problem is ambiguous across those dimensions, that's a sign to start with prompt engineering and treat it as the baseline everything else gets measured against.

The $40k mistake wasn't made because the team was reckless. It was made because "fine-tuning" sounds like the serious, production-grade choice — and "prompt engineering" sounds like a workaround. That framing is exactly backwards. Prompt engineering is the foundation. The other two are extensions you bolt on when the foundation can't carry the full load.

RAG's Silent Failure Mode

RAG is the right call when knowledge freshness matters more than anything else. You embed your documents, store them in a vector index, retrieve the relevant chunks at query time, and pass them to the model as context. The model reasons over current information without ever being trained on it. Simple in principle.

The failure mode nobody talks about enough: wrong chunks retrieved. The model gets semantically adjacent but factually incorrect context, generates a confident answer from it, and you have no signal that anything went wrong. The answer looks fine. It reads well. It's wrong. And you don't find out until a user notices — which they often don't, because wrong answers that sound authoritative are hard to catch without domain knowledge.

Why Retrieval Failures Are Invisible

Most RAG evaluation measures answer quality — fluency, relevance, coherence. Retrieval quality is almost never evaluated separately. So you end up with a pipeline where you can verify the model is generating good text from whatever context it gets, but you have no idea whether the context is right. A model that's excellent at reasoning over bad chunks is a confident hallucination machine. Fix retrieval first. Everything downstream of bad retrieval is noise.

I debugged exactly this class of problem when building my personal memory system — the Qdrant-backed brain I wrote about in the AI Brain post. The symptom was false positives: a query about one topic pulling chunks from a semantically similar but contextually wrong document. The retrieval scores looked fine. The answers looked reasonable. Only when I started logging which chunks were actually retrieved on each query did I see that the pipeline was routinely confident about wrong context.

The variables that matter for retrieval quality in practice:

Chunk size: Smaller chunks surface precise facts but lose context. Larger chunks preserve context but dilute the embedding signal. There's no universal right answer — it depends on whether queries are factual lookups or contextual questions. The mistake is picking one and never testing the other.
Hybrid search: Keyword plus semantic consistently outperforms pure semantic in any domain with precise terminology. If your documents contain model numbers, policy codes, API names, or specific identifiers, pure semantic search will miss exact matches that keyword search trivially finds. Run both and merge the results.
Embedding model alignment: Your query embeddings and your document embeddings need to come from similar training distributions. Mismatches produce retrieval failures that are genuinely hard to diagnose — the vectors look like they should match, the cosine similarities are reasonable, but the retrieved content is consistently slightly wrong in ways that don't produce obvious errors. This is the failure mode that eats weeks of debugging time.

When Fine-Tuning Is Actually the Answer

Fine-tuning earns its complexity when the problem requires consistent style or format at a scale where context-stuffing is impractical, or when the task requires offline inference with no retrieval infrastructure. Those are real use cases. They're also narrower than the fine-tuning hype suggests.

The test I use: if a well-crafted system prompt with 5–10 few-shot examples gets you 80% of the behavior you need, fine-tuning will get you to 90% — at the cost of a full retraining cycle every time requirements change. That tradeoff is sometimes worth it. More often it isn't, because requirements change more often than teams expect when they're scoping the project.

The real cost of fine-tuning isn't the training run. It's the update cycle: data preparation, training, evaluation, deployment — every time your domain shifts. For a product documentation use case where the docs update monthly, that's not a pipeline you want. For a task with genuinely stable output requirements and high volume, it makes sense.

"The right order is always: prompt engineering → RAG → fine-tuning. Most problems stop at step one or step two. Fine-tuning is for the last 10%, and it costs the most to update."

What a System Prompt That Actually Works Looks Like

The "prompt engineering" that fails isn't prompt engineering — it's adding a one-sentence system prompt and calling it done. "You are a helpful assistant. Answer questions about our product." That's not a system prompt. That's a placeholder.

The structure I use across every pipeline I've built:

1. Role with scope constraints. Not just "you are an X" — but "you are an X who only handles Y, declines Z, and always does W." The constraint is the useful part. Without it, the model will helpfully drift into adjacent territory and produce answers that are technically plausible but operationally wrong.

2. Domain rules, not domain facts. Don't stuff the system prompt with facts the model already knows. Use it to encode the rules that are specific to your context: what terminology means in your domain, what the edge cases are, what the model should do when it hits ambiguity. Facts belong in retrieval. Rules belong in the system prompt.

3. Output format with a real example. "Respond in JSON" is insufficient. Show the exact schema. Show a filled-in example. The difference in output consistency between "respond in JSON" and "respond in this exact JSON format: [example]" is substantial and immediate.

4. Explicit failure handling. Tell the model what to do when it doesn't know. If you don't, it will hallucinate by default. "If the information needed to answer this question is not in the provided context, say so and explain what's missing" — that one line eliminates the majority of confident wrong answers from RAG pipelines.

Version control your system prompt like production code. Test changes against a fixed evaluation set. The teams that treat prompt iteration as a casual process tend to find that "the AI got worse last week" — and have no idea when or why it changed. That's a tooling failure, not an AI failure.

My Local RAG Stack

For local inference I run Qdrant for vector storage on my NAS at port 6333, nomic-embed-text via Ollama for embeddings, and Ollama for generation. Simple HTTP calls between components, no framework.

The specific reason I use nomic-embed-text over alternatives like all-MiniLM or bge-small: it was trained on a much larger and more diverse dataset, which means its embedding space handles domain-specific language better without fine-tuning the embedder itself. For a personal knowledge base that spans SAP documentation, architecture notes, and job research, a general-purpose model that generalizes well beats a specialized model that's slightly better on benchmarks but brittle on out-of-distribution inputs. It also runs fast on CPU — embedding 500 chunks takes seconds, not minutes.

Local RAG Stack — Self-Hosted

1. Query

→

User question or agent request via HTTP POST

2. nomic-embed-text

Ollama :11434

Embeds query into 768-dim vector; fast on CPU, no GPU needed

3. Qdrant :6333

knowledge base collection

Cosine similarity search → top-k chunks (k=5 default); thousands of points indexed

4. Context Assembly

→

Retrieved chunks injected into prompt template with source metadata

5. qwen2.5-coder:14b

Ollama :11434

Generates answer grounded in retrieved context; no hallucination on out-of-context queries

6. Response

→

Answer + source chunk references returned. No LangChain. Direct HTTP. Zero abstraction overhead.

The stack in one line: Qdrant :6333 → nomic-embed-text (Ollama) → query → top-k chunks → Ollama generation. No LangChain. No vector store abstraction layer. Direct HTTP. The overhead of a framework adds failure modes before you've solved the actual retrieval problem. Start simple, add complexity only after you understand where the baseline breaks.

Before You Commit to Any Technique

Technique Selection Sanity Check

Have you built a 20+ question evaluation set before touching any infrastructure? If not, you don't know what "good" looks like yet — and you can't tell whether a technique is working.

If considering RAG: are you logging retrieved chunks per query and evaluating retrieval quality independently from answer quality? If you're only measuring the final answer, you will miss retrieval failures until a user catches them.

If considering fine-tuning: have you modeled the update cycle cost — data prep, training, eval, deploy — for how often your requirements will realistically change over the next 12 months? That's the real cost, not the training run.

The Root Cause of the $40k Mistake

No evaluation set before the technique was chosen. The team defined success as "the model answers questions about our docs" — not "the model answers these 50 specific questions correctly, including these 10 edge cases." If they'd built that set first, they would have discovered in week 1 that a prompted base model with the docs in context cleared the bar. The evaluation set forces you to define what "good" actually means. Once you've done that, the technique choice usually becomes obvious — or you find that the problem is harder to specify than you thought, which is equally useful information before you spend $40k finding out the hard way.

Spend two hours on prompt engineering before anything else. If you're at 70%, add retrieval. If you're still short and the use case genuinely justifies the update cycle overhead, consider fine-tuning. That's the order. It's not exciting advice, but it's the one that doesn't end with a stale model in production and an invoice you can't explain.