Worksheet: Named Entity Recognition

Background

You are annotating data for a Named Entity Recognition (NER) system that identifies people, organizations, locations, and miscellaneous entities in text.

NER seems straightforward but involves many difficult decisions about:

Entity boundaries (where does an entity start and end?)
Entity types (is "Apple" a company or a fruit?)
Nested entities (can entities contain other entities?)

Entity Types

PER Person names

ORG Organizations

LOC Locations

MISC Miscellaneous (events, products, etc.)

BIO Tagging Scheme

B-TYPE Beginning of entity

I-TYPE Inside/continuation of entity

O Outside any entity

Part 1: Basic Entity Recognition

Sentence 1

"Barack Obama visited the White House in Washington."

Barack

Obama

visited

the

White

House

in

Washington

Question 1

Is "White House" a location (LOC) or an organization (ORG)?

LOC (it's a building) ORG (it represents the administration) Depends on context

Explain your reasoning:

Part 2: Ambiguous Entity Types

Sentence 2

"Apple announced new products at their Cupertino headquarters."

Apple

announced

new

products

at

their

Cupertino

headquarters

Question 2

Should "Cupertino headquarters" be tagged as one entity or two?

One entity: "Cupertino headquarters" (LOC) Just "Cupertino" (LOC) Neither is an entity

What's the boundary decision principle here?

Part 3: Metonymy and Reference

Sentence 3

"Washington condemned the attacks on civilians."

Washington

condemned

the

attacks

on

civilians

Question 3

"Washington" here refers to the U.S. government, not the city. How should it be tagged?

LOC (surface form is a place) ORG (semantic reference is government) GPE (geopolitical entity - special type)

Should annotation follow surface form or semantic meaning?

Part 4: Complex Named Entities

Sentence 4

"The New York Times reported on University of California, Berkeley research."

The

New

York

Times

reported

on

University

of

California

,

Berkeley

research

Question 4

Should "The" be included in "The New York Times"?

Yes, it's part of the official name No, articles are never part of entities Depends on specific guidelines

How do you handle punctuation in "University of California, Berkeley"?

Part 5: Nested Entities

Sentence 5

"The Bank of America Tower is the tallest building in Charlotte."

Question 5

"Bank of America Tower" contains "Bank of America" (ORG) and refers to a building (LOC). How should this be handled?

Tag only the outer entity (LOC) Tag only the inner entity (ORG) Tag both (nested annotation) Use longest match rule

What are the tradeoffs of each approach?

Part 6: Group Comparison

Compare your annotations with your group members.

Question 6

Where did you disagree most?

Entity type assignment Entity boundaries Whether something is an entity at all Handling nested entities

Give one specific example of disagreement:

Part 7: Guideline Design

Question 7

Which guideline change would most improve annotator agreement?

More examples for each entity type Clear rules for metonymy Explicit boundary conventions Decision on nested entity handling Finer-grained entity types

Explain your choice:

Part 8: Reflection

Question 8

Why is NER harder than it appears?

Names are ambiguous (Apple, Washington) Boundaries are context-dependent Type systems don't capture reality Requires world knowledge All of the above

Key Takeaway

Named Entity Recognition requires making decisions that combine linguistic form with world knowledge.

Entity types are categories we impose on a continuous reality
Boundaries are conventions, not facts
Context determines meaning, but annotations are often context-free