AI Glossary: What Is N-Gram Language Model? Definition & Meaning

An N-グラム言語モデル is a statistical language model used in 自然言語処理 and 計算言語学で to predict the likelihood of a sequence of words. It operates based on the principle of n-grams, which are contiguous sequences of ‘n’ items (typically words) from a given text or speech. The simplest form is the unigram model, which considers individual words in isolation, while bigram and trigram models consider two and three words, respectively, as a unit.

In practice, an N-Gram Language Model estimates the probability of a word occurring given the previous ‘n-1’ words. For example, in a bigram model, the probability of the word ‘dog’ following ‘the’ is calculated based on occurrences of the sequence ‘the dog’ in the training data. This allows the model to capture local context and dependencies in the language, which is critical for tasks like speech recognition, 機械翻訳, and text generation.

However, N-Gram models have limitations, such as their inability to capture long-range dependencies and the sparsity of data for larger n-grams in extensive vocabulary contexts. To address these issues, techniques like smoothing, back-off methods, and the use of larger corpora are often employed. Despite their simplicity, N-Gram models form the foundation for more complex 言語モデルの, including neural network-based approaches.