What is position bias in LLM-as-judge evaluation?

Position bias means LLM judges often rate the first response higher regardless of quality. Mitigate this by running evaluations with swapped orders and averaging, or by evaluating each response independently.

Module 5Lesson 2

Scoring Rubrics & Metrics

Design rubrics and select metrics that reliably measure what matters for your use case.

7 min read

2 quiz questions2 templates

The metric you choose determines what you optimize for. Common metrics include accuracy (for factual tasks), BLEU/ROUGE (for text similarity), F1 score (for extraction), and custom rubric scores (for open-ended tasks). The wrong metric leads to optimizing the wrong thing.

Classification: Accuracy, precision, recall, F1 score
Extraction: Exact match, F1 over extracted entities, character-level overlap
Summarization: ROUGE scores, factual consistency (via LLM judge), compression ratio
Generation: LLM-as-judge scores, human preference ratings, task-specific rubrics
RAG: Answer correctness, faithfulness (only uses provided context), relevance of retrieved chunks

A rubric defines exactly what each score level means. Vague rubrics produce inconsistent ratings. Good rubrics are anchored with specific examples at each level and focus on observable criteria rather than subjective judgments.

Vague rubric: 5 = Excellent, 3 = Average, 1 = Poor Anchored rubric (for a customer service response): 5 = Directly answers the question, provides specific actionable steps, empathetic tone 4 = Answers the question with minor omissions, mostly actionable 3 = Partially answers but misses key details or includes irrelevant information 2 = Tangentially related but doesn't solve the user's problem 1 = Wrong answer, ignores the question, or inappropriate tone

When using LLM-as-judge, always test for position bias: swap the order of responses being compared. Many models prefer whichever response appears first.

Prompt Templates

Custom Rubric Builder

Generates anchored evaluation rubrics with specific examples at each score level.

I need a scoring rubric for evaluating [TASK TYPE] outputs. The most important quality criteria are: [LIST 3-4 CRITERIA].

Create a 1-5 rubric for each criterion with:
- A clear description of what each score level means
- A concrete example of output that would receive that score
- Common mistakes that would lower the score

Format as a table.

Metric Selection Advisor

Gets expert advice on which evaluation metrics to use for your specific application.

I'm building an eval suite for [APPLICATION]. The task involves [TASK DESCRIPTION]. Users care most about [QUALITY PRIORITIES].

Recommend: (1) Primary metric with justification, (2) Secondary metrics to track, (3) How to measure each (automated vs. LLM-judge vs. human), (4) Red flags that indicate the wrong metric.

Test Your Knowledge

Knowledge Check

1 / 2

Why are anchored rubrics better than vague rubrics?

Key Takeaways

✓Choose metrics that match your task type — wrong metrics lead to optimizing the wrong thing
✓Anchor rubrics with specific examples at each score level for consistent, reproducible evaluation
✓Test LLM-as-judge for position bias by swapping response order and averaging results

Previous Lesson Next Lesson

Continue Learning

Building Eval Suites

Create test suites that measure prompt quality across diverse inputs and edge cases.

8 min

Regression Testing & Versioning

Track prompt versions, detect regressions, and maintain quality as prompts evolve.

7 min

Tree of Thoughts & Self-Consistency

Explore branching reasoning paths and majority-vote strategies to dramatically improve accuracy on hard problems.

9 min