Evaluating machine translation quality is essential for improving MT systems. Human evaluation complements automatic metrics (BLEU, METEOR) by capturing nuances they miss.
Adequacy: Does the translation convey the same meaning as the source?
Fluency: Is the translation grammatical and natural-sounding in the target language?
Accuracy Mistranslation, omission, addition of meaning
Fluency Grammar, spelling, punctuation errors
Terminology Wrong domain-specific terms
Style Register, formality mismatches
"Der Hund hat den ganzen Tag im Garten gespielt."
"The dog has played in garden all day."
"The dog played in the garden all day."
Rate the MT output:
Adequacy (meaning preservation):
Fluency (naturalness):
The Multidimensional Quality Metrics (MQM) framework annotates specific errors with severity.
"Le ministre a annonce que le budget sera reduit de 15% l'annee prochaine."
"The minister announced that the budget will be reduced by 15% next year."
Identify any errors (there may be none):
| Error Span | Type | Severity |
|---|---|---|
"No me gusta nada este restaurante."
"I don't like this restaurant at all."
"I don't like anything about this restaurant."
Both translations are grammatical. Which is more accurate?
"Der Patient zeigt Symptome einer akuten Gastritis."
"The patient shows symptoms of acute stomach inflammation."
"acute gastritis" (not "stomach inflammation")
Is "stomach inflammation" an acceptable translation of "Gastritis"?
"Kare wa totemo isogashii desu."
"He is very busy."
"He's really swamped."
"He very busy is."
Rank these translations from best (1) to worst (3):
| System | Rank | Reason |
|---|---|---|
| System A | ||
| System B | ||
| System C |
(Assume you don't read Chinese)
"The company announced quarterly profits exceeded expectations by a wide margin, leading to a 5% surge in stock prices."
Without knowing the source, can you still evaluate fluency?
Fluency rating:
Compare your evaluations with your group. Where did you disagree?
Why is MT evaluation difficult?
Translation quality is multi-dimensional and context-dependent.