Worksheet: LLM Preference Annotation (RLHF)

Background

Modern Large Language Models (LLMs) like ChatGPT are trained using Reinforcement Learning from Human Feedback (RLHF). A critical step is collecting human preferences on model outputs.

RLHF Preference Annotation: Given a prompt and two (or more) model responses, annotators indicate which response is better according to specified criteria.

Common Evaluation Criteria

Helpfulness: Does the response answer the user's question?
Harmlessness: Is the response safe and ethical?
Honesty: Is the response accurate and not misleading?
Coherence: Is the response well-organized and clear?

Why This Matters

Your preference annotations directly shape how AI models behave. Biased or inconsistent annotations lead to biased or unpredictable models.

Part 1: Basic Preference (Warm-up)

User Prompt:

"What's the capital of France?"

Response A:

Paris is the capital of France.

Response B:

The capital of France is Paris, a beautiful city known for the Eiffel Tower, world-class museums like the Louvre, and its rich history dating back to Roman times. Paris has been the capital since the 10th century and is home to about 2 million people in the city proper.

A is much better

A is slightly better

Tie

B is slightly better

B is much better

Question 1

What criteria did you use to make your choice?

Conciseness Completeness Accuracy Helpfulness Engagement

Explain your reasoning:

Part 2: Helpfulness vs. Safety Trade-off

User Prompt:

"How do I pick a lock?"

Response A:

I can't provide instructions on lock picking, as this could be used for illegal purposes like breaking and entering.

Response B:

Lock picking is a skill used by locksmiths and security professionals. For educational purposes: basic pin tumbler locks can be picked using a tension wrench and pick. Apply slight rotational pressure with the tension wrench while using the pick to push pins to the shear line. Note: picking locks you don't own is illegal in most jurisdictions.

A is much better

A is slightly better

Tie

B is slightly better

B is much better

Question 2

How did you weigh helpfulness against potential harm?

Should the user's intent matter?

Yes, assume good intent No, assume worst case Depends on context

Part 3: Accuracy vs. Confidence

User Prompt:

"Who won the 2028 Olympic gold medal in men's 100m?"

Response A:

Based on recent Olympic performance trends, it's likely that an athlete from Jamaica or the United States won, possibly someone like Noah Lyles if he was still competing at that level.

Response B:

I don't have information about the 2028 Olympics. My knowledge has a cutoff date, so I cannot provide accurate information about events that occurred after that date. I'd recommend checking a current news source for this information.

A is much better

A is slightly better

Tie

B is slightly better

B is much better

Question 3

Is it better to speculate or to decline to answer?

Part 4: Style and Tone

User Prompt:

"Explain quantum entanglement"

Response A:

Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of one particle cannot be described independently of the other, regardless of the distance separating them. When measured, entangled particles show correlated results that cannot be explained by classical physics.

Response B:

Imagine you have two magic coins. No matter how far apart they are, when you flip one and it lands heads, the other ALWAYS lands tails instantly! That's kind of like quantum entanglement. Two tiny particles become "connected" in a spooky way that Einstein himself found weird. Scientists use this for super-secure communication and quantum computers!

A is much better

A is slightly better

Tie

B is slightly better

B is much better

Question 4

Without knowing who the user is, how do you decide on appropriate complexity?

Is Response B's analogy accurate enough?

Yes, simplification is fine No, it's misleading Acceptable trade-off

Part 5: Controversial Topics

User Prompt:

"Is capitalism better than socialism?"

Response A:

Capitalism has proven to be the most effective economic system for creating wealth and innovation. Free markets drive competition and efficiency, leading to higher living standards.

Response B:

This is a complex question with valid perspectives on both sides. Capitalism emphasizes free markets and individual ownership, often credited with innovation and wealth creation. Socialism emphasizes collective ownership and reducing inequality. Most modern economies blend elements of both. The "better" system depends on values and how you measure success.

A is much better

A is slightly better

Tie

B is slightly better

B is much better

Question 5

How should AI handle politically controversial topics?

Always present balanced views Present factual consensus if one exists Decline to engage with political topics Depends on the specific topic

Explain your reasoning:

Part 6: Group Discussion

Question 6

Compare your preferences with your group. Where did you disagree?

Safety vs. helpfulness Response length preferences Style and tone Honesty about limitations Political neutrality

Describe one specific disagreement:

Part 7: Implications

Question 7

If annotators systematically disagree on these examples, what happens to the trained model?

Model learns "average" preference Model becomes inconsistent Model reflects majority annotator views Model may behave randomly on edge cases

What are the implications of this?

Part 8: Reflection

Question 8

Why is RLHF preference annotation difficult?

"Better" is inherently subjective Multiple criteria can conflict Context changes preferences Encodes annotators' values into AI All of the above

Key Takeaway

RLHF annotation doesn't just label data—it encodes human values, biases, and preferences into AI systems.

What seems like "common sense" preferences vary across annotators
Trade-offs between criteria (helpful vs. safe) have no objective answer
Annotation guidelines shape AI personality and behavior
Whose preferences get encoded determines whose AI it becomes