You are annotating data for a Named Entity Recognition (NER) system that identifies people, organizations, locations, and miscellaneous entities in text.
NER seems straightforward but involves many difficult decisions about:
PER Person namesORG OrganizationsLOC LocationsMISC Miscellaneous (events, products, etc.)B-TYPE Beginning of entityI-TYPE Inside/continuation of entityO Outside any entityIs "White House" a location (LOC) or an organization (ORG)?
Should "Cupertino headquarters" be tagged as one entity or two?
"Washington" here refers to the U.S. government, not the city. How should it be tagged?
Should "The" be included in "The New York Times"?
"Bank of America Tower" contains "Bank of America" (ORG) and refers to a building (LOC). How should this be handled?
Compare your annotations with your group members.
Where did you disagree most?
Which guideline change would most improve annotator agreement?
Why is NER harder than it appears?
Named Entity Recognition requires making decisions that combine linguistic form with world knowledge.