Worksheet: Summarization Quality Evaluation

Multi-dimensional assessment of generated summaries
Course: Natural Language Annotation for Machine Learning Task Type: Quality evaluation / Multi-dimensional rating
Author: Jin Zhao

Background

Evaluating summarization quality requires assessing multiple dimensions simultaneously. Unlike simple classification, summary evaluation involves subjective judgments about what information is important and how well it's conveyed.

Key Evaluation Dimensions:

  • Faithfulness: Is the summary factually consistent with the source?
  • Relevance: Does it include important information?
  • Coherence: Is it well-organized and readable?
  • Conciseness: Is it appropriately brief without losing meaning?

Part 1: Faithfulness Evaluation

Source Article:

Scientists at MIT have developed a new battery technology that could charge electric vehicles in just 10 minutes. The research, published in Nature Energy, shows that the lithium-ion batteries use a modified electrode structure. Lead researcher Dr. Sarah Chen noted that while promising, the technology is still 3-5 years from commercial production. Current EV batteries typically require 30-60 minutes for a full charge.

Summary A:

MIT researchers have created a revolutionary battery that charges EVs in 10 minutes, making range anxiety a thing of the past. The breakthrough, led by Dr. Sarah Chen, will be available to consumers within 2 years.

Rate Faithfulness (1-5):
Question 1

Identify specific faithfulness errors in Summary A:

Part 2: Relevance vs. Faithfulness Trade-off

Summary B (Same source):

A study in Nature Energy describes lithium-ion batteries with modified electrode structures.

Rate Summary B:

Faithfulness:

Relevance:

Question 2

Summary B is faithful but misses key information. Which is worse: unfaithful or incomplete?

Part 3: Coherence Evaluation

Source Article:

The city council approved a $50 million budget for road repairs. Mayor Johnson praised the decision, calling it "long overdue." Critics argue the money should go to schools instead. The repairs will focus on the downtown area first, then expand to suburbs over three years.

Summary C:

Road repairs will start downtown. $50 million was approved by the council. Mayor Johnson praised it. Critics want school funding. Suburbs will be repaired later.

Summary D:

The city council approved $50 million for road repairs that will begin downtown and expand to suburbs over three years. While Mayor Johnson praised the decision, critics argue the funding should go to schools instead.

Question 3

Both summaries contain the same information. Rate their coherence:

Summary C coherence:

Summary D coherence:

Part 4: Overall Quality Ranking

Question 4

Rank the summaries from best to worst overall:

Part 5: Error Severity

Question 5

Rank these summary errors from most to least severe:

Part 6: Group Discussion

Question 6

Compare ratings with your group. Where did you disagree?

Part 7: Reflection

Question 7

Why is summarization evaluation difficult?

Key Takeaway

Summary evaluation is inherently multi-dimensional, and annotators must balance competing criteria while maintaining consistent standards.

  • Faithfulness errors are not all equally severe
  • A summary can excel on one dimension while failing on another
  • Relevance depends on who the summary is for
  • Annotation guidelines must specify dimension priorities