Worksheet: Toxicity Detection

Background

You are part of a team building an AI system that classifies online comments as:

Toxic
Not Toxic

This system will be used for content moderation, meaning mistakes can affect real people.

Your goal is to:

Assign labels
Reflect on disagreement
Think about annotation design decisions

Part 1: Label Definition (Before Annotating)

Read the following preliminary label definitions:

Toxic: Language that is insulting, demeaning, or hostile toward a person or group.

Not Toxic: Language that does not meet the above criteria.

Question 1

Are these definitions sufficient to label comments consistently?

Yes No

Why or why not?

Part 2: Individual Annotation

For each comment below: Assign a label, Write why you chose it, and Note any uncertainty.

Example 1

"Wow, great job."

Toxic Not Toxic

Reasoning:

Uncertainty (if any):

Example 2

"Nice work, genius."

Toxic Not Toxic

Reasoning:

Uncertainty (if any):

Example 3

"This idea is stupid."

Toxic Not Toxic

Reasoning:

Uncertainty (if any):

Example 4

"Sure, that'll work. Just like last time."

Toxic Not Toxic

Reasoning:

Uncertainty (if any):

Example 5

"I can't believe you actually think this makes sense."

Toxic Not Toxic

Reasoning:

Uncertainty (if any):

Part 3: Group Comparison

Discuss your answers with 2–3 classmates.

Question 2

Which examples had disagreement in your group? (Check all that apply.)

Example 1 Example 2 Example 3 Example 4 Example 5

Question 3

What caused the disagreement? (Check all that apply.)

Sarcasm or irony Lack of context Different interpretations of "hostile" Different personal thresholds Unclear guidelines Other:

Part 4: Dataset-Level Decisions

You now see the full annotation results for Example 2:

"Nice work, genius."

Out of 10 annotators:

6

labeled Toxic

4

labeled Not Toxic

Question 4

How should this example be handled in the dataset?

Use the majority label Remove the example Add an "Ambiguous" label Keep all labels and model uncertainty Ask expert annotators to re-label

Explain your choice:

Part 5: Revising the Task Definition

Question 5

Which of the following guideline changes would most reduce disagreement?

Explicit rules about sarcasm Clear distinction between attacking ideas vs. people Providing surrounding context Adding more label categories More annotator training

Explain your choice:

Part 6: Reflection

Question 6

Why is this classification task difficult even though it has only two labels?

Language is ambiguous Meaning depends on context Annotators bring personal norms Label definitions are incomplete All of the above

Question 7

If you trained a model on this data without resolving disagreement, what might happen?

Key Takeaway (for class discussion)

Classification difficulty is driven by human judgment, not label count.

Even simple label sets can produce:

High disagreement
Biased models
Unstable predictions