You are part of a team building an AI system that classifies online comments as:
This system will be used for content moderation, meaning mistakes can affect real people.
Your goal is to:
Read the following preliminary label definitions:
Toxic: Language that is insulting, demeaning, or hostile toward a person or group.
Not Toxic: Language that does not meet the above criteria.
Are these definitions sufficient to label comments consistently?
For each comment below: Assign a label, Write why you chose it, and Note any uncertainty.
Discuss your answers with 2–3 classmates.
Which examples had disagreement in your group? (Check all that apply.)
What caused the disagreement? (Check all that apply.)
You now see the full annotation results for Example 2:
Out of 10 annotators:
How should this example be handled in the dataset?
Which of the following guideline changes would most reduce disagreement?
Why is this classification task difficult even though it has only two labels?
If you trained a model on this data without resolving disagreement, what might happen?
Classification difficulty is driven by human judgment, not label count.
Even simple label sets can produce: