Worksheet: Machine Translation Quality Evaluation

Assessing accuracy, fluency, and adequacy in MT output
Course: Natural Language Annotation for Machine Learning Task Type: Ordinal rating + Error annotation
Author: Jin Zhao

Background

Evaluating machine translation quality is essential for improving MT systems. Human evaluation complements automatic metrics (BLEU, METEOR) by capturing nuances they miss.

Adequacy: Does the translation convey the same meaning as the source?

Fluency: Is the translation grammatical and natural-sounding in the target language?

Common Error Types

Accuracy Mistranslation, omission, addition of meaning

Fluency Grammar, spelling, punctuation errors

Terminology Wrong domain-specific terms

Style Register, formality mismatches

Part 1: Adequacy vs. Fluency

Source (German):

"Der Hund hat den ganzen Tag im Garten gespielt."

MT Output:

"The dog has played in garden all day."

Reference Translation:

"The dog played in the garden all day."

Question 1

Rate the MT output:

Adequacy (meaning preservation):

Fluency (naturalness):

Part 2: Error Annotation (MQM Style)

The Multidimensional Quality Metrics (MQM) framework annotates specific errors with severity.

Source (French):

"Le ministre a annonce que le budget sera reduit de 15% l'annee prochaine."

MT Output:

"The minister announced that the budget will be reduced by 15% next year."

Question 2

Identify any errors (there may be none):

Error SpanTypeSeverity

Part 3: Meaning Shifts

Source (Spanish):

"No me gusta nada este restaurante."

MT Output A:

"I don't like this restaurant at all."

MT Output B:

"I don't like anything about this restaurant."

Question 3

Both translations are grammatical. Which is more accurate?

Part 4: Domain-Specific Translation

Source (Medical, German):

"Der Patient zeigt Symptome einer akuten Gastritis."

MT Output:

"The patient shows symptoms of acute stomach inflammation."

Expected Medical Term:

"acute gastritis" (not "stomach inflammation")

Question 4

Is "stomach inflammation" an acceptable translation of "Gastritis"?

Part 5: Ranking Translations

Source (Japanese):

"Kare wa totemo isogashii desu."

System A:

"He is very busy."

System B:

"He's really swamped."

System C:

"He very busy is."

Question 5

Rank these translations from best (1) to worst (3):

SystemRankReason
System A
System B
System C

Part 6: Reference-Free Evaluation

Source (Chinese):

(Assume you don't read Chinese)

MT Output:

"The company announced quarterly profits exceeded expectations by a wide margin, leading to a 5% surge in stock prices."

Question 6

Without knowing the source, can you still evaluate fluency?

Fluency rating:

Part 7: Group Discussion

Question 7

Compare your evaluations with your group. Where did you disagree?

Part 8: Reflection

Question 8

Why is MT evaluation difficult?

Key Takeaway

Translation quality is multi-dimensional and context-dependent.

  • Adequacy and fluency can conflict—which matters more depends on use case
  • Error annotation requires clear typologies and severity guidelines
  • Domain and register affect what counts as "correct"
  • Multiple valid translations mean human evaluation requires calibration