What is the purpose of the re-ranking step?

Vector similarity search is fast but can return chunks that are topically related but do not actually answer the question. Re-ranking uses an LLM to filter these out, improving answer precision.

Why use a low temperature (e.g., 0.2) for the final synthesis call?

In RAG, you want the model to faithfully report what the context says, not creatively rephrase or invent details. Low temperature reduces randomness and keeps the output close to the most likely (and usually most accurate) tokens.

Module 6Lesson 2

Retrieval & Answer Synthesis

Complete the RAG pipeline: retrieve relevant chunks, craft synthesis prompts, and generate grounded answers with citations.

15 min read

3 quiz questions

Phase 2: Retrieval & Synthesis

With your documents chunked, embedded, and stored, the second half of the RAG pipeline handles query-time logic: embed the user question, retrieve the top-k most similar chunks, and synthesize an answer with a carefully designed prompt. The synthesis prompt is where prompt engineering meets information retrieval — and it is where most RAG pipelines succeed or fail.

Step 4: Query & Retrieve

from openai import OpenAI
import chromadb

client = OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_collection("documents")

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Embed query and retrieve the top-k most relevant chunks."""
    # Embed the query with the same model used for documents
    response = client.embeddings.create(
        input=[query],
        model="text-embedding-3-small",
    )
    query_embedding = response.data[0].embedding
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    
    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        chunks.append({
            "text": doc,
            "source": meta["source"],
            "chunk_index": meta["chunk_index"],
            "similarity": 1 - dist,  # cosine distance → similarity
        })
    
    return chunks


# Example
results = retrieve("What is the company's remote work policy?")
for r in results:
    print(f"[sim={r['similarity']:.3f}] {r['text'][:100]}...")

Always embed your query with the SAME model used for document embeddings. Mixing models produces vectors in different spaces and retrieval will return garbage.

Step 5: Answer Synthesis

The synthesis prompt is the most critical piece of prompt engineering in a RAG system. It must instruct the model to: (1) base its answer only on the provided context, (2) cite which chunks it used, and (3) say "I don't know" when the context does not contain the answer. Without these guardrails the model will hallucinate confidently.

RAG Synthesis Prompt

The core synthesis prompt that grounds LLM answers in retrieved context and forces citation.

You are a knowledgeable assistant that answers questions using ONLY the provided context. If the context does not contain enough information to answer the question fully, say "I don't have enough information to answer that" and explain what is missing.

Context (retrieved documents):
{{context_chunks}}

User question: {{user_question}}

Instructions:
1. Answer the question using ONLY information from the context above.
2. After your answer, add a "Sources" section listing the chunk index numbers you used (e.g., [Chunk 2], [Chunk 4]).
3. If two chunks provide conflicting information, acknowledge the discrepancy and present both.
4. Do NOT use any knowledge beyond what is in the context.
5. Keep your answer concise — aim for 2-4 paragraphs maximum.

Best with: OpenAI / Claude / Gemini

def answer_question(question: str, top_k: int = 5) -> str:
    """Full RAG pipeline: retrieve context, synthesize answer."""
    chunks = retrieve(question, top_k=top_k)

    # Format context with chunk indices for citation
    context_str = "\n\n".join(
        f"[Chunk {c['chunk_index']}] (source: {c['source']}, similarity: {c['similarity']:.3f})\n{c['text']}"
        for c in chunks
    )

    instructions = (
        "You are a knowledgeable assistant that answers questions using ONLY "
        "the provided context. If the context does not contain enough information, "
        "say so. Always cite chunk numbers."
    )

    user_prompt = f"""Context (retrieved documents):
{context_str}

Question: {question}

Answer using only the context above. Cite sources as [Chunk N]."""

    response = client.responses.create(
        model="gpt-5.5",
        reasoning={"effort": "low"},
        instructions=instructions,
        input=user_prompt,
    )

    return response.output_text


# Try it
answer = answer_question("What is the company's remote work policy?")
print(answer)

Step 6: Re-Ranking for Better Precision

Vector similarity retrieval is fast but imprecise. A re-ranking step uses the LLM itself to score each retrieved chunk's relevance to the question, then keeps only the best ones. This dramatically reduces hallucination caused by irrelevant chunks sneaking into the context.

Chunk Relevance Re-Ranker

Scores each retrieved chunk's relevance so you can filter out noise before synthesis.

You are a relevance scoring engine. Given a user question and a document chunk, rate how relevant the chunk is for answering the question.

Question: {{user_question}}

Chunk:
{{chunk_text}}

Rate the relevance from 0 to 10:
- 0 = completely irrelevant
- 5 = somewhat related but does not directly answer the question
- 10 = directly and completely answers the question

Respond with ONLY a JSON object: {"score": <number>, "reason": "<one sentence>"}

Best with: Fast model tier (OpenAI / Claude / Gemini)

import json

def rerank_chunks(question: str, chunks: list[dict], threshold: float = 5.0) -> list[dict]:
    """Use a fast model tier to re-score chunk relevance and filter low-quality results."""
    scored = []

    for chunk in chunks:
        response = client.responses.create(
            model="gpt-5.4-mini",
            instructions="You are a relevance scoring engine. Return valid JSON only.",
            input=f"Rate relevance 0-10.

Question: {question}

Chunk:
{chunk['text']}

Respond with ONLY JSON: {{"score": <number>, "reason": "<one sentence>"}}",
        )

        try:
            result = json.loads(response.output_text)
            chunk["relevance_score"] = result["score"]
            chunk["relevance_reason"] = result["reason"]
        except (json.JSONDecodeError, KeyError):
            chunk["relevance_score"] = 0
            chunk["relevance_reason"] = "Failed to parse"

        scored.append(chunk)

    # Filter and sort by relevance
    filtered = [c for c in scored if c["relevance_score"] >= threshold]
    filtered.sort(key=lambda c: c["relevance_score"], reverse=True)

    print(f"Re-ranked: {len(filtered)}/{len(chunks)} chunks above threshold {threshold}")
    return filtered

Use a cheap, fast model tier for re-ranking since you make one call per chunk. Reserve the more capable model for the final synthesis step where quality matters most.

Putting the Full Pipeline Together

Load and chunk documents (Phase 1)
Generate embeddings and store in ChromaDB (Phase 1)
Embed the user query with the same model
Retrieve top-k chunks by cosine similarity
Re-rank chunks with an LLM relevance scorer
Build the context string with chunk IDs
Send context + question to the synthesis prompt
Return the grounded answer with citations

RAG Quality Evaluation Prompt

Evaluates RAG pipeline output quality across four dimensions for iterative improvement.

Evaluate the following RAG system response for quality.

Original question: {{question}}
Retrieved context: {{context}}
Generated answer: {{answer}}

Score each dimension from 1-5:
1. **Faithfulness** — Does the answer only contain claims supported by the context?
2. **Relevance** — Does the answer address the actual question asked?
3. **Completeness** — Does the answer use all relevant information from the context?
4. **Citation accuracy** — Are the cited chunk numbers correct?

For each dimension, provide the score and a one-sentence justification.
Finally, give an overall score (average) and one specific suggestion for improvement.

Best with: OpenAI / Claude / Gemini

Test Your Knowledge

Knowledge Check

1 / 3

Why should the synthesis prompt tell the model to say "I don't know" when context is insufficient?

Key Takeaways

✓Always embed queries with the same model used for document embeddings.
✓The synthesis prompt must explicitly forbid hallucination and require citations.
✓Re-ranking with a cheap LLM dramatically improves retrieval precision.
✓Use low temperature for synthesis to maximize factual accuracy.
✓Evaluate RAG quality on faithfulness, relevance, completeness, and citation accuracy.

Previous Lesson Next Lesson

Continue Learning

Document Chunking & Embedding

Learn to split documents into optimal chunks and generate vector embeddings — the foundation of any RAG tutorial prompt engineering workflow.

15 min

Build a Complete AI Content Creation Workflow

Design and execute a multi-step content pipeline: research, outline, draft, edit, and SEO optimize — all powered by AI prompts.

18 min

Design a Complete AI Customer Support System Prompt

Build a professional system prompt for a customer support chatbot that handles tone, boundaries, escalation, and common questions gracefully.

16 min