Building RAG Pipelines

Assemble end-to-end RAG systems with query routing, re-ranking, and answer synthesis.

9 min read
2 quiz questions

A production RAG system is more than embed-retrieve-generate. A robust pipeline includes query understanding, retrieval with re-ranking, context assembly, generation with grounding, and answer validation. Each stage has prompt engineering opportunities.

  1. Query understanding: Classify intent, expand query, extract filters
  2. Retrieval: Vector search + optional keyword search (hybrid retrieval)
  3. Re-ranking: Score retrieved chunks by relevance using a cross-encoder or LLM
  4. Context assembly: Select top chunks, order them, add metadata
  5. Generation: Prompt the LLM with assembled context + instructions
  6. Validation: Check for hallucination, verify citations, assess confidence

Vector search excels at semantic matching but can miss exact keywords (like product IDs or error codes). Keyword search (BM25) catches exact matches but misses paraphrases. Hybrid retrieval combines both, typically with Reciprocal Rank Fusion (RRF) to merge results. This consistently outperforms either method alone by 5-15%.

Initial retrieval is fast but imprecise. Re-ranking takes the top 20-50 retrieved chunks and scores each one against the query using a more powerful model (like a cross-encoder). This is slower but much more accurate. Tools like Cohere Rerank or an LLM-as-judge can do this.

The generation prompt is the most impactful part of the pipeline. Key principles: (1) instruct the model to only use provided context, (2) require source citations, (3) explicitly allow "I don't know" responses, (4) format context clearly with source labels.

The #1 RAG failure mode is the model ignoring retrieved context and answering from its training data. Always include an explicit instruction: "Answer ONLY based on the provided context."

Prompt Templates

RAG Generation with Validation

Production-grade RAG generation prompt with grounding, citation, and conflict handling.

You are a helpful assistant. Answer the question using ONLY the context below. Follow these rules:
- If the context doesn't contain the answer, say "I don't have enough information."
- Cite sources using [Source: document_name] after each claim
- If sources conflict, note the discrepancy

Context:
[RETRIEVED AND RE-RANKED CHUNKS WITH SOURCE LABELS]

Question: [QUESTION]

LLM Re-Ranker

Uses an LLM as a re-ranker to improve retrieval precision before generation.

Given the query and the following passages, rate each passage's relevance from 0-10 and briefly explain why.

Query: [QUERY]

Passage 1: [CHUNK 1]
Passage 2: [CHUNK 2]
Passage 3: [CHUNK 3]

Return results sorted by relevance score (highest first).

Test Your Knowledge

Knowledge Check

1 / 2

Why does hybrid retrieval outperform vector-only search?

Key Takeaways

  • Production RAG has six stages: query understanding, retrieval, re-ranking, context assembly, generation, and validation
  • Hybrid retrieval (vector + keyword) outperforms either alone by 5-15%
  • Always instruct the generation model to answer only from provided context and cite sources