AI Glossary: What Is N-gram Model? Definition & Meaning

N-gram Model

An N-gram model is a statistical language model used in natural language processing (NLP) and computational linguistics. It predicts the next item in a sequence (such as a word or character) based on the history of the previous ‘n-1’ items. The term ‘N-gram’ refers to the number of items in the sequence. For example, in a bigram model (where n=2), the model looks at pairs of words, while in a trigram model (where n=3), it looks at triplets of words.

The N-gram model operates on the principle of conditional probability. It computes the probability of a word given the previous words in the sequence. This is expressed mathematically as:

P(w_n | w_1, w_2, …, w_{n-1})

where ‘w_n’ is the current word, and ‘w_1, w_2, …, w_{n-1}’ are the preceding words. The model is built by analyzing a large corpus of text to count occurrences of these N-grams and using these counts to estimate probabilities.

N-gram models are widely used in various applications, including text prediction, speech recognition, and machine translation. They are simple to implement and can provide reasonable performance, especially when combined with techniques like smoothing to handle unseen N-grams. However, they also have limitations, such as the inability to capture long-range dependencies (context beyond n-1 words) and the exponential growth of the state space as ‘n’ increases, which can lead to data sparsity issues.