AI Glossary: What Is Bleu Score Metric? Definition & Meaning

Das Bleu-Score Metrik, often abbreviated as BLEU, is a popular Bewertungsmetrik used in the field of Natürliche Sprachverarbeitung (NLP) to assess the quality of text produced by maschinelle Übersetzung systems and other text generation models. Developed in the early 2000s, BLEU measures how closely the output of a model aligns with one or more reference texts, typically human-generated translations or summaries.

BLEU operates on the principle of comparing n-grams (contiguous sequences of n items) in the generated text with those in the reference texts. The basic formula for BLEU involves calculating the precision of n-grams, which is the ratio of the number of overlapping n-grams in the generated text to the total number of n-grams. BLEU also incorporates a brevity penalty to discourage short translations that might achieve high precision but fail to convey the full meaning of the source text.

Die Metrik liefert eine Punktzahl zwischen 0 und 1, wobei eine Punktzahl von 1 eine perfekte Übereinstimmung mit den Referenztexten anzeigt. Allerdings hat der BLEU-Score einige Einschränkungen; er konzentriert sich hauptsächlich auf die Präzision und kann wichtige kontextuelle oder semantische Unterschiede übersehen. Zudem kann er empfindlich auf die Länge des Ausgabetextes reagieren, weshalb die Kürzungsstrafe eingeführt wurde.

Despite its drawbacks, BLEU remains widely used because it provides a straightforward and quantitative way to evaluate and compare machine-generated text against human standards. It has been instrumental in benchmarking various NLP systems and continues to evolve with the advancement of KI-Technologien.