Retrieval & Answer Synthesis

Complete the RAG pipeline: retrieve relevant chunks, craft synthesis prompts, and generate grounded answers with citations.

15 min read
3 quiz questions

Phase 2: Retrieval & Synthesis

With your documents chunked, embedded, and stored, the second half of the RAG pipeline handles query-time logic: embed the user question, retrieve the top-k most similar chunks, and synthesize an answer with a carefully designed prompt. The synthesis prompt is where prompt engineering meets information retrieval — and it is where most RAG pipelines succeed or fail.

Step 4: Query & Retrieve

from openai import OpenAI
import chromadb

client = OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_collection("documents")

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Embed query and retrieve the top-k most relevant chunks."""
    # Embed the query with the same model used for documents
    response = client.embeddings.create(
        input=[query],
        model="text-embedding-3-small",
    )
    query_embedding = response.data[0].embedding
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    
    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        chunks.append({
            "text": doc,
            "source": meta["source"],
            "chunk_index": meta["chunk_index"],
            "similarity": 1 - dist,  # cosine distance → similarity
        })
    
    return chunks


# Example
results = retrieve("What is the company's remote work policy?")
for r in results:
    print(f"[sim={r['similarity']:.3f}] {r['text'][:100]}...")
Always embed your query with the SAME model used for document embeddings. Mixing models produces vectors in different spaces and retrieval will return garbage.

Step 5: Answer Synthesis

The synthesis prompt is the most critical piece of prompt engineering in a RAG system. It must instruct the model to: (1) base its answer only on the provided context, (2) cite which chunks it used, and (3) say "I don't know" when the context does not contain the answer. Without these guardrails the model will hallucinate confidently.

RAG Synthesis Prompt

The core synthesis prompt that grounds LLM answers in retrieved context and forces citation.

You are a knowledgeable assistant that answers questions using ONLY the provided context. If the context does not contain enough information to answer the question fully, say "I don't have enough information to answer that" and explain what is missing.

Context (retrieved documents):
{{context_chunks}}

User question: {{user_question}}

Instructions:
1. Answer the question using ONLY information from the context above.
2. After your answer, add a "Sources" section listing the chunk index numbers you used (e.g., [Chunk 2], [Chunk 4]).
3. If two chunks provide conflicting information, acknowledge the discrepancy and present both.
4. Do NOT use any knowledge beyond what is in the context.
5. Keep your answer concise — aim for 2-4 paragraphs maximum.
Best with: GPT-4o / Claude
def answer_question(question: str, top_k: int = 5) -> str:
    """Full RAG pipeline: retrieve context, synthesize answer."""
    chunks = retrieve(question, top_k=top_k)
    
    # Format context with chunk indices for citation
    context_str = "\n\n".join(
        f"[Chunk {c['chunk_index']}] (source: {c['source']}, similarity: {c['similarity']:.3f})\n{c['text']}"
        for c in chunks
    )
    
    system_prompt = (
        "You are a knowledgeable assistant that answers questions using ONLY "
        "the provided context. If the context does not contain enough information, "
        "say so. Always cite chunk numbers."
    )
    
    user_prompt = f"""Context (retrieved documents):
{context_str}

Question: {question}

Answer using only the context above. Cite sources as [Chunk N]."""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.2,  # low temperature for factual accuracy
    )
    
    return response.choices[0].message.content


# Try it
answer = answer_question("What is the company's remote work policy?")
print(answer)

Step 6: Re-Ranking for Better Precision

Vector similarity retrieval is fast but imprecise. A re-ranking step uses the LLM itself to score each retrieved chunk's relevance to the question, then keeps only the best ones. This dramatically reduces hallucination caused by irrelevant chunks sneaking into the context.

Chunk Relevance Re-Ranker

Scores each retrieved chunk's relevance so you can filter out noise before synthesis.

You are a relevance scoring engine. Given a user question and a document chunk, rate how relevant the chunk is for answering the question.

Question: {{user_question}}

Chunk:
{{chunk_text}}

Rate the relevance from 0 to 10:
- 0 = completely irrelevant
- 5 = somewhat related but does not directly answer the question
- 10 = directly and completely answers the question

Respond with ONLY a JSON object: {"score": <number>, "reason": "<one sentence>"}
Best with: GPT-4o-mini / Claude Haiku
import json

def rerank_chunks(question: str, chunks: list[dict], threshold: float = 5.0) -> list[dict]:
    """Use the LLM to re-score chunk relevance and filter low-quality results."""
    scored = []
    
    for chunk in chunks:
        response = client.chat.completions.create(
            model="gpt-4o-mini",  # fast & cheap for scoring
            messages=[{
                "role": "user",
                "content": f"Rate relevance 0-10.\n\nQuestion: {question}\n\nChunk:\n{chunk['text']}\n\nRespond with ONLY JSON: {{\"score\": <number>, \"reason\": \"<one sentence>\"}}",
            }],
            temperature=0,
        )
        
        try:
            result = json.loads(response.choices[0].message.content)
            chunk["relevance_score"] = result["score"]
            chunk["relevance_reason"] = result["reason"]
        except (json.JSONDecodeError, KeyError):
            chunk["relevance_score"] = 0
            chunk["relevance_reason"] = "Failed to parse"
        
        scored.append(chunk)
    
    # Filter and sort by relevance
    filtered = [c for c in scored if c["relevance_score"] >= threshold]
    filtered.sort(key=lambda c: c["relevance_score"], reverse=True)
    
    print(f"Re-ranked: {len(filtered)}/{len(chunks)} chunks above threshold {threshold}")
    return filtered
Use a cheap, fast model (GPT-4o-mini or Claude Haiku) for re-ranking since you make one call per chunk. Reserve the powerful model for the final synthesis step where quality matters most.

Putting the Full Pipeline Together

  1. Load and chunk documents (Phase 1)
  2. Generate embeddings and store in ChromaDB (Phase 1)
  3. Embed the user query with the same model
  4. Retrieve top-k chunks by cosine similarity
  5. Re-rank chunks with an LLM relevance scorer
  6. Build the context string with chunk IDs
  7. Send context + question to the synthesis prompt
  8. Return the grounded answer with citations

RAG Quality Evaluation Prompt

Evaluates RAG pipeline output quality across four dimensions for iterative improvement.

Evaluate the following RAG system response for quality.

Original question: {{question}}
Retrieved context: {{context}}
Generated answer: {{answer}}

Score each dimension from 1-5:
1. **Faithfulness** — Does the answer only contain claims supported by the context?
2. **Relevance** — Does the answer address the actual question asked?
3. **Completeness** — Does the answer use all relevant information from the context?
4. **Citation accuracy** — Are the cited chunk numbers correct?

For each dimension, provide the score and a one-sentence justification.
Finally, give an overall score (average) and one specific suggestion for improvement.
Best with: GPT-4o / Claude

Test Your Knowledge

Knowledge Check

1 / 3

Why should the synthesis prompt tell the model to say "I don't know" when context is insufficient?

Key Takeaways

  • Always embed queries with the same model used for document embeddings.
  • The synthesis prompt must explicitly forbid hallucination and require citations.
  • Re-ranking with a cheap LLM dramatically improves retrieval precision.
  • Use low temperature for synthesis to maximize factual accuracy.
  • Evaluate RAG quality on faithfulness, relevance, completeness, and citation accuracy.