Worksheet: Named Entity Recognition (NER)

Understanding boundary decisions and entity type ambiguity
Course: Natural Language Annotation for Machine Learning Task Type: Sequence labeling (BIO tagging)
Author: Jin Zhao

Background

You are annotating data for a Named Entity Recognition (NER) system that identifies people, organizations, locations, and miscellaneous entities in text.

NER seems straightforward but involves many difficult decisions about:

Entity Types

PER Person names
ORG Organizations
LOC Locations
MISC Miscellaneous (events, products, etc.)

BIO Tagging Scheme

B-TYPE Beginning of entity
I-TYPE Inside/continuation of entity
O Outside any entity

Part 1: Basic Entity Recognition

Sentence 1
"Barack Obama visited the White House in Washington."
Barack
Obama
visited
the
White
House
in
Washington
Question 1

Is "White House" a location (LOC) or an organization (ORG)?

Part 2: Ambiguous Entity Types

Sentence 2
"Apple announced new products at their Cupertino headquarters."
Apple
announced
new
products
at
their
Cupertino
headquarters
Question 2

Should "Cupertino headquarters" be tagged as one entity or two?

Part 3: Metonymy and Reference

Sentence 3
"Washington condemned the attacks on civilians."
Washington
condemned
the
attacks
on
civilians
Question 3

"Washington" here refers to the U.S. government, not the city. How should it be tagged?

Part 4: Complex Named Entities

Sentence 4
"The New York Times reported on University of California, Berkeley research."
The
New
York
Times
reported
on
University
of
California
,
Berkeley
research
Question 4

Should "The" be included in "The New York Times"?

Part 5: Nested Entities

Sentence 5
"The Bank of America Tower is the tallest building in Charlotte."
Question 5

"Bank of America Tower" contains "Bank of America" (ORG) and refers to a building (LOC). How should this be handled?

Part 6: Group Comparison

Compare your annotations with your group members.

Question 6

Where did you disagree most?

Part 7: Guideline Design

Question 7

Which guideline change would most improve annotator agreement?

Part 8: Reflection

Question 8

Why is NER harder than it appears?

Key Takeaway

Named Entity Recognition requires making decisions that combine linguistic form with world knowledge.

  • Entity types are categories we impose on a continuous reality
  • Boundaries are conventions, not facts
  • Context determines meaning, but annotations are often context-free