A Hybrid Approach to Critical Error Detection

This report visualizes an experiment on the WMT21 Critical Error Detection task. We explore a novel hybrid system combining the COMETKiwi-23 XL quality estimation model with a TinyLlama-1.1B verifier to accurately identify high-impact translation errors across four language pairs.

The Challenge

The WMT21 task is to classify machine translations as either containing a "critical error" (label 1) or not (label 0). A key challenge is the severe class imbalance, with far fewer critical errors than acceptable translations.

Dataset Composition

The experiment uses 4,000 samples from the WMT21 development set, split evenly across four language pairs. This chart shows the label distribution within each pair.

Hybrid System Architecture

Our method generates two distinct signals for each translation and fuses them for a final prediction. This flowchart illustrates the process from input to classification.

Input

Source & MT Sentence

COMETKiwi-23 XL

Generates a continuous quality score.

TinyLlama Verifier

Generates a binary "critical error" flag (Yes/No).

Feature Fusion

[Score, Flag]

Logistic Regression

Final Prediction (0 or 1)

Experiment Results

Overall Performance

Matthews Correlation Coefficient (MCC)

0.282

Across all 4 language pairs.

Hover over a bar to see a detailed analysis.

This chart compares the model's performance on each language pair, revealing how linguistic differences can impact accuracy.