Building Eval Suites

Create test suites that measure prompt quality across diverse inputs and edge cases.

8 min read
2 quiz questions

Without systematic evaluation, prompt engineering is guesswork. You change a prompt, test it on one or two examples, and hope it works. Eval suites let you test changes against dozens or hundreds of cases, catching regressions and measuring improvements quantitatively.

An eval suite is a collection of test cases, each with an input, the expected output (or criteria for a good output), and a scoring function. When you modify a prompt, you run the full suite and compare scores.

  1. Happy path: Typical inputs that should work well (40% of cases)
  2. Edge cases: Unusual inputs, boundary conditions, empty inputs (30% of cases)
  3. Adversarial inputs: Attempts to break the prompt, injection payloads (15% of cases)
  4. Regression cases: Specific inputs that previously failed — ensure they stay fixed (15% of cases)

  • Exact match: Output must exactly match expected answer (good for classification, extraction)
  • Contains/regex: Output must contain specific strings or match patterns
  • LLM-as-judge: Use a model to rate the output on criteria (best for open-ended tasks)
  • Human evaluation: Manually rate a sample of outputs (gold standard but slow)

Several tools streamline eval creation and execution: Promptfoo (open-source CLI), Braintrust, LangSmith, and custom scripts. At minimum, you need a spreadsheet of test cases and a script that runs them against your prompt and scores results.

Start with 20-30 test cases — enough to catch major issues without being overwhelming. Grow the suite organically by adding every failure you encounter in production as a new test case.

Prompt Templates

Eval Test Case Generator

Generates a balanced set of eval test cases for any prompt.

I have a prompt that [PROMPT PURPOSE]. Generate 20 eval test cases:

- 8 happy path (typical inputs with expected behavior)
- 6 edge cases (unusual inputs, boundary conditions)
- 3 adversarial (injection attempts, constraint violations)
- 3 regression-prone (ambiguous inputs likely to cause inconsistent behavior)

For each: input, expected output/criteria, and why this case matters.

LLM-as-Judge Evaluator

Structured LLM-as-judge prompt for consistent evaluation of AI outputs.

You are an expert evaluator. Rate this AI response on a scale of 1-5 for each criterion:

Task: [ORIGINAL TASK]
AI Response: [RESPONSE]

Criteria:
- Accuracy (1-5): Are all facts and claims correct?
- Completeness (1-5): Does it fully address the task?
- Relevance (1-5): Does it stay on topic without unnecessary content?
- Clarity (1-5): Is it well-organized and easy to understand?

Provide scores and a one-line justification for each. Final score = average.

Test Your Knowledge

Knowledge Check

1 / 2

What is the primary benefit of an eval suite over manual testing?

Key Takeaways

  • Eval suites replace guesswork with quantitative measurement — test every prompt change against diverse cases
  • Balance test cases: 40% happy path, 30% edge cases, 15% adversarial, 15% regression
  • Start with 20-30 cases and grow organically by adding every production failure as a new test