Worksheet: Machine Translation Quality Evaluation

Background

Evaluating machine translation quality is essential for improving MT systems. Human evaluation complements automatic metrics (BLEU, METEOR) by capturing nuances they miss.

Adequacy: Does the translation convey the same meaning as the source?

Fluency: Is the translation grammatical and natural-sounding in the target language?

Common Error Types

Accuracy Mistranslation, omission, addition of meaning

Fluency Grammar, spelling, punctuation errors

Terminology Wrong domain-specific terms

Style Register, formality mismatches

Part 1: Adequacy vs. Fluency

Source (German):

"Der Hund hat den ganzen Tag im Garten gespielt."

MT Output:

"The dog has played in garden all day."

Reference Translation:

"The dog played in the garden all day."

Question 1

Rate the MT output:

Adequacy (meaning preservation):

1 None

2 Little

3 Much

4 Most

5 All

Fluency (naturalness):

1 Incomprehensible

2 Disfluent

3 Non-native

4 Good

5 Flawless

What errors do you notice?

Part 2: Error Annotation (MQM Style)

The Multidimensional Quality Metrics (MQM) framework annotates specific errors with severity.

Source (French):

"Le ministre a annonce que le budget sera reduit de 15% l'annee prochaine."

MT Output:

"The minister announced that the budget will be reduced by 15% next year."

Question 2

Identify any errors (there may be none):

Error Span	Type	Severity

Is this a good translation overall? Why?

Part 3: Meaning Shifts

Source (Spanish):

"No me gusta nada este restaurante."

MT Output A:

"I don't like this restaurant at all."

MT Output B:

"I don't like anything about this restaurant."

Question 3

Both translations are grammatical. Which is more accurate?

Output A is more accurate Output B is more accurate Both are equally accurate Neither captures the meaning well

What's the semantic difference between the two?

Part 4: Domain-Specific Translation

Source (Medical, German):

"Der Patient zeigt Symptome einer akuten Gastritis."

MT Output:

"The patient shows symptoms of acute stomach inflammation."

Expected Medical Term:

"acute gastritis" (not "stomach inflammation")

Question 4

Is "stomach inflammation" an acceptable translation of "Gastritis"?

Acceptable - meaning is preserved Minor error - should use medical term Major error - incorrect in medical context Depends on target audience

How should domain-specific terminology be evaluated?

Part 5: Ranking Translations

Source (Japanese):

"Kare wa totemo isogashii desu."

System A:

"He is very busy."

System B:

"He's really swamped."

System C:

"He very busy is."

Question 5

Rank these translations from best (1) to worst (3):

System	Rank	Reason
System A
System B
System C

Is System B's informal register a problem?

Part 6: Reference-Free Evaluation

Source (Chinese):

(Assume you don't read Chinese)

MT Output:

"The company announced quarterly profits exceeded expectations by a wide margin, leading to a 5% surge in stock prices."

Question 6

Without knowing the source, can you still evaluate fluency?

Fluency rating:

1 Poor

2 Below avg

3 Average

4 Good

5 Excellent

What are the limitations of reference-free evaluation?

Part 7: Group Discussion

Question 7

Compare your evaluations with your group. Where did you disagree?

Scale interpretation Error identification Error severity System ranking Adequacy vs fluency tradeoff

Describe one specific disagreement:

Part 8: Reflection

Question 8

Why is MT evaluation difficult?

Quality is subjective Adequacy-fluency tradeoffs Many valid translations exist Domain expertise needed Requires bilingual evaluators All of the above

Key Takeaway

Translation quality is multi-dimensional and context-dependent.

Adequacy and fluency can conflict—which matters more depends on use case
Error annotation requires clear typologies and severity guidelines
Domain and register affect what counts as "correct"
Multiple valid translations mean human evaluation requires calibration