What is the purpose of the re-ranking stage?

Re-ranking uses a cross-encoder or LLM to carefully evaluate each retrieved chunk against the query, reordering them by true relevance. This improves precision of the context passed to the generation step.

Module 2Lesson 3

Building RAG Pipelines

Assemble end-to-end RAG systems with query routing, re-ranking, and answer synthesis.

9 min read

2 quiz questions2 templates

A production RAG system is more than embed-retrieve-generate. A robust pipeline includes query understanding, retrieval with re-ranking, context assembly, generation with grounding, and answer validation. Each stage has prompt engineering opportunities.

Query understanding: Classify intent, expand query, extract filters
Retrieval: Vector search + optional keyword search (hybrid retrieval)
Re-ranking: Score retrieved chunks by relevance using a cross-encoder or LLM
Context assembly: Select top chunks, order them, add metadata
Generation: Prompt the LLM with assembled context + instructions
Validation: Check for hallucination, verify citations, assess confidence

Vector search excels at semantic matching but can miss exact keywords (like product IDs or error codes). Keyword search (BM25) catches exact matches but misses paraphrases. Hybrid retrieval combines both, typically with Reciprocal Rank Fusion (RRF) to merge results. This consistently outperforms either method alone by 5-15%.

Initial retrieval is fast but imprecise. Re-ranking takes the top 20-50 retrieved chunks and scores each one against the query using a more powerful model (like a cross-encoder). This is slower but much more accurate. Tools like Cohere Rerank or an LLM-as-judge can do this.

The generation prompt is the most impactful part of the pipeline. Key principles: (1) instruct the model to only use provided context, (2) require source citations, (3) explicitly allow "I don't know" responses, (4) format context clearly with source labels.

The #1 RAG failure mode is the model ignoring retrieved context and answering from its training data. Always include an explicit instruction: "Answer ONLY based on the provided context."

Prompt Templates

RAG Generation with Validation

Production-grade RAG generation prompt with grounding, citation, and conflict handling.

You are a helpful assistant. Answer the question using ONLY the context below. Follow these rules:
- If the context doesn't contain the answer, say "I don't have enough information."
- Cite sources using [Source: document_name] after each claim
- If sources conflict, note the discrepancy

Context:
[RETRIEVED AND RE-RANKED CHUNKS WITH SOURCE LABELS]

Question: [QUESTION]

LLM Re-Ranker

Uses an LLM as a re-ranker to improve retrieval precision before generation.

Given the query and the following passages, rate each passage's relevance from 0-10 and briefly explain why.

Query: [QUERY]

Passage 1: [CHUNK 1]
Passage 2: [CHUNK 2]
Passage 3: [CHUNK 3]

Return results sorted by relevance score (highest first).

Test Your Knowledge

Knowledge Check

1 / 2

Why does hybrid retrieval outperform vector-only search?

Key Takeaways

✓Production RAG has six stages: query understanding, retrieval, re-ranking, context assembly, generation, and validation
✓Hybrid retrieval (vector + keyword) outperforms either alone by 5-15%
✓Always instruct the generation model to answer only from provided context and cite sources

Previous Lesson Next Lesson

Continue Learning

Embeddings & Vector Stores

Understand how text becomes searchable vectors and how vector databases power semantic search.

8 min

Chunking Strategies

Learn how to split documents into chunks that maximize retrieval quality and minimize noise.

7 min

Tree of Thoughts & Self-Consistency

Explore branching reasoning paths and majority-vote strategies to dramatically improve accuracy on hard problems.

9 min