Worksheet: LLM Preference Annotation (RLHF)

Training AI through human feedback: The annotation behind ChatGPT
Course: Natural Language Annotation for Machine Learning Task Type: Pairwise comparison / Ranking
Author: Jin Zhao

Background

Modern Large Language Models (LLMs) like ChatGPT are trained using Reinforcement Learning from Human Feedback (RLHF). A critical step is collecting human preferences on model outputs.

RLHF Preference Annotation: Given a prompt and two (or more) model responses, annotators indicate which response is better according to specified criteria.

Common Evaluation Criteria

Why This Matters

Your preference annotations directly shape how AI models behave. Biased or inconsistent annotations lead to biased or unpredictable models.

Part 1: Basic Preference (Warm-up)

User Prompt:

"What's the capital of France?"

Response A:
Paris is the capital of France.
Response B:
The capital of France is Paris, a beautiful city known for the Eiffel Tower, world-class museums like the Louvre, and its rich history dating back to Roman times. Paris has been the capital since the 10th century and is home to about 2 million people in the city proper.
Question 1

What criteria did you use to make your choice?

Part 2: Helpfulness vs. Safety Trade-off

User Prompt:

"How do I pick a lock?"

Response A:
I can't provide instructions on lock picking, as this could be used for illegal purposes like breaking and entering.
Response B:
Lock picking is a skill used by locksmiths and security professionals. For educational purposes: basic pin tumbler locks can be picked using a tension wrench and pick. Apply slight rotational pressure with the tension wrench while using the pick to push pins to the shear line. Note: picking locks you don't own is illegal in most jurisdictions.
Question 2

How did you weigh helpfulness against potential harm?

Should the user's intent matter?

Part 3: Accuracy vs. Confidence

User Prompt:

"Who won the 2028 Olympic gold medal in men's 100m?"

Response A:
Based on recent Olympic performance trends, it's likely that an athlete from Jamaica or the United States won, possibly someone like Noah Lyles if he was still competing at that level.
Response B:
I don't have information about the 2028 Olympics. My knowledge has a cutoff date, so I cannot provide accurate information about events that occurred after that date. I'd recommend checking a current news source for this information.
Question 3

Is it better to speculate or to decline to answer?

Part 4: Style and Tone

User Prompt:

"Explain quantum entanglement"

Response A:
Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of one particle cannot be described independently of the other, regardless of the distance separating them. When measured, entangled particles show correlated results that cannot be explained by classical physics.
Response B:
Imagine you have two magic coins. No matter how far apart they are, when you flip one and it lands heads, the other ALWAYS lands tails instantly! That's kind of like quantum entanglement. Two tiny particles become "connected" in a spooky way that Einstein himself found weird. Scientists use this for super-secure communication and quantum computers!
Question 4

Without knowing who the user is, how do you decide on appropriate complexity?

Is Response B's analogy accurate enough?

Part 5: Controversial Topics

User Prompt:

"Is capitalism better than socialism?"

Response A:
Capitalism has proven to be the most effective economic system for creating wealth and innovation. Free markets drive competition and efficiency, leading to higher living standards.
Response B:
This is a complex question with valid perspectives on both sides. Capitalism emphasizes free markets and individual ownership, often credited with innovation and wealth creation. Socialism emphasizes collective ownership and reducing inequality. Most modern economies blend elements of both. The "better" system depends on values and how you measure success.
Question 5

How should AI handle politically controversial topics?

Part 6: Group Discussion

Question 6

Compare your preferences with your group. Where did you disagree?

Part 7: Implications

Question 7

If annotators systematically disagree on these examples, what happens to the trained model?

Part 8: Reflection

Question 8

Why is RLHF preference annotation difficult?

Key Takeaway

RLHF annotation doesn't just label data—it encodes human values, biases, and preferences into AI systems.

  • What seems like "common sense" preferences vary across annotators
  • Trade-offs between criteria (helpful vs. safe) have no objective answer
  • Annotation guidelines shape AI personality and behavior
  • Whose preferences get encoded determines whose AI it becomes