Scoring Rubrics & Metrics

Design rubrics and select metrics that reliably measure what matters for your use case.

7 min read
2 quiz questions

The metric you choose determines what you optimize for. Common metrics include accuracy (for factual tasks), BLEU/ROUGE (for text similarity), F1 score (for extraction), and custom rubric scores (for open-ended tasks). The wrong metric leads to optimizing the wrong thing.

  • Classification: Accuracy, precision, recall, F1 score
  • Extraction: Exact match, F1 over extracted entities, character-level overlap
  • Summarization: ROUGE scores, factual consistency (via LLM judge), compression ratio
  • Generation: LLM-as-judge scores, human preference ratings, task-specific rubrics
  • RAG: Answer correctness, faithfulness (only uses provided context), relevance of retrieved chunks

A rubric defines exactly what each score level means. Vague rubrics produce inconsistent ratings. Good rubrics are anchored with specific examples at each level and focus on observable criteria rather than subjective judgments.

Vague rubric: 5 = Excellent, 3 = Average, 1 = Poor Anchored rubric (for a customer service response): 5 = Directly answers the question, provides specific actionable steps, empathetic tone 4 = Answers the question with minor omissions, mostly actionable 3 = Partially answers but misses key details or includes irrelevant information 2 = Tangentially related but doesn't solve the user's problem 1 = Wrong answer, ignores the question, or inappropriate tone
When using LLM-as-judge, always test for position bias: swap the order of responses being compared. Many models prefer whichever response appears first.

Prompt Templates

Custom Rubric Builder

Generates anchored evaluation rubrics with specific examples at each score level.

I need a scoring rubric for evaluating [TASK TYPE] outputs. The most important quality criteria are: [LIST 3-4 CRITERIA].

Create a 1-5 rubric for each criterion with:
- A clear description of what each score level means
- A concrete example of output that would receive that score
- Common mistakes that would lower the score

Format as a table.

Metric Selection Advisor

Gets expert advice on which evaluation metrics to use for your specific application.

I'm building an eval suite for [APPLICATION]. The task involves [TASK DESCRIPTION]. Users care most about [QUALITY PRIORITIES].

Recommend: (1) Primary metric with justification, (2) Secondary metrics to track, (3) How to measure each (automated vs. LLM-judge vs. human), (4) Red flags that indicate the wrong metric.

Test Your Knowledge

Knowledge Check

1 / 2

Why are anchored rubrics better than vague rubrics?

Key Takeaways

  • Choose metrics that match your task type — wrong metrics lead to optimizing the wrong thing
  • Anchor rubrics with specific examples at each score level for consistent, reproducible evaluation
  • Test LLM-as-judge for position bias by swapping response order and averaging results