Modern Large Language Models (LLMs) like ChatGPT are trained using Reinforcement Learning from Human Feedback (RLHF). A critical step is collecting human preferences on model outputs.
RLHF Preference Annotation: Given a prompt and two (or more) model responses, annotators indicate which response is better according to specified criteria.
Your preference annotations directly shape how AI models behave. Biased or inconsistent annotations lead to biased or unpredictable models.
"What's the capital of France?"
What criteria did you use to make your choice?
"How do I pick a lock?"
How did you weigh helpfulness against potential harm?
Should the user's intent matter?
"Who won the 2028 Olympic gold medal in men's 100m?"
Is it better to speculate or to decline to answer?
"Explain quantum entanglement"
Without knowing who the user is, how do you decide on appropriate complexity?
Is Response B's analogy accurate enough?
"Is capitalism better than socialism?"
How should AI handle politically controversial topics?
Compare your preferences with your group. Where did you disagree?
If annotators systematically disagree on these examples, what happens to the trained model?
Why is RLHF preference annotation difficult?
RLHF annotation doesn't just label data—it encodes human values, biases, and preferences into AI systems.