N-Gramm-Modell
Ein N-Gramm-Modell ist ein statistisches Sprachmodell used in der Verarbeitung natürlicher Sprache (NLP) and der computergestützten Sprachwissenschaft. It predicts the next item in a sequence (such as a word or character) based on the history of the previous ‘n-1’ items. The term ‘N-gram’ refers to the number of items in the sequence. For example, in a bigram model (where n=2), the model looks at pairs of words, while in a trigram model (where n=3), it looks at triplets of words.
Das N-Gramm-Modell basiert auf dem Prinzip von bedingte Wahrscheinlichkeit modelliert. It computes the probability of a word given the previous words in the sequence. This is expressed mathematically as:
P(w_n | w_1, w_2, …, w_{n-1})
where ‘w_n’ is the current word, and ‘w_1, w_2, …, w_{n-1}’ are the preceding words. The model is built by analyzing a large corpus of text to count occurrences of these N-grams and using these counts to estimate probabilities.
N-gram models are widely used in various applications, including text prediction, speech recognition, and maschinelle Übersetzung. They are simple to implement and can provide reasonable performance, especially when combined with techniques like smoothing to handle unseen N-grams. However, they also have limitations, such as the inability to capture long-range dependencies (context beyond n-1 words) and the exponential growth of the state space as ‘n’ increases, which can lead to data sparsity issues.