I

Inverse Dokumentenfrequenz

IDF

Inverse Document Frequency (IDF) misst, wie viel Information ein Wort liefert, basierend auf seiner Seltenheit in Dokumenten.

Der Inverse Dokumentenfrequenz (IDF) ist ein statistisches Maß, das verwendet wird in dem Informationsretrieval and der Verarbeitung natürlicher Sprache to evaluate the importance of a term in a document relative to a collection of documents or corpus. The concept is often combined with Term Frequency (TF) to create the TF-IDF score, which helps determine how relevant a word is within a specific document compared to its occurrence in the entire dataset.

IDF wird berechnet, indem die Gesamtzahl der Dokumente genommen und durch die Anzahl der Dokumente dividiert wird, die den Begriff enthalten, gefolgt von der Anwendung des Logarithmus dieses Quotienten. Die Formel lautet:

IDF(t) = log(N / df(t))

Wo:

  • IDF(t) is the inverse document frequency of term t.
  • N ist die Gesamtzahl der Dokumente.
  • df(t) is the number of documents containing the term t.

Die Bedeutung von IDF liegt in seiner Fähigkeit, die weight of terms that occur very frequently across documents, as these terms provide less discriminative power for identifying relevant documents. In contrast, terms that are rare across the corpus will have a higher IDF score, indicating they are more valuable for distinguishing documents.

For example, common words like ‘the’, ‘is’, and ‘and’ would have a low IDF score, while more unique terms related to a specific topic would score higher, highlighting their relevance. This makes IDF a crucial component in various applications, including search engines, Dokumentenclustering, and text classification, where understanding the significance of terms in context is essential.

Strg + /