Scoring Rubrics & Metrics
Design rubrics and select metrics that reliably measure what matters for your use case.
The metric you choose determines what you optimize for. Common metrics include accuracy (for factual tasks), BLEU/ROUGE (for text similarity), F1 score (for extraction), and custom rubric scores (for open-ended tasks). The wrong metric leads to optimizing the wrong thing.
- Classification: Accuracy, precision, recall, F1 score
- Extraction: Exact match, F1 over extracted entities, character-level overlap
- Summarization: ROUGE scores, factual consistency (via LLM judge), compression ratio
- Generation: LLM-as-judge scores, human preference ratings, task-specific rubrics
- RAG: Answer correctness, faithfulness (only uses provided context), relevance of retrieved chunks
A rubric defines exactly what each score level means. Vague rubrics produce inconsistent ratings. Good rubrics are anchored with specific examples at each level and focus on observable criteria rather than subjective judgments.
Prompt Templates
Custom Rubric Builder
Generates anchored evaluation rubrics with specific examples at each score level.
I need a scoring rubric for evaluating [TASK TYPE] outputs. The most important quality criteria are: [LIST 3-4 CRITERIA]. Create a 1-5 rubric for each criterion with: - A clear description of what each score level means - A concrete example of output that would receive that score - Common mistakes that would lower the score Format as a table.
Metric Selection Advisor
Gets expert advice on which evaluation metrics to use for your specific application.
I'm building an eval suite for [APPLICATION]. The task involves [TASK DESCRIPTION]. Users care most about [QUALITY PRIORITIES]. Recommend: (1) Primary metric with justification, (2) Secondary metrics to track, (3) How to measure each (automated vs. LLM-judge vs. human), (4) Red flags that indicate the wrong metric.
Test Your Knowledge
Knowledge Check
1 / 2
Why are anchored rubrics better than vague rubrics?
Key Takeaways
- ✓Choose metrics that match your task type — wrong metrics lead to optimizing the wrong thing
- ✓Anchor rubrics with specific examples at each score level for consistent, reproducible evaluation
- ✓Test LLM-as-judge for position bias by swapping response order and averaging results
Continue Learning
Building Eval Suites
Create test suites that measure prompt quality across diverse inputs and edge cases.
Regression Testing & Versioning
Track prompt versions, detect regressions, and maintain quality as prompts evolve.
Tree of Thoughts & Self-Consistency
Explore branching reasoning paths and majority-vote strategies to dramatically improve accuracy on hard problems.