I

逆文書頻度

IDF

逆文書頻度(IDF)は、単語の情報量を、その単語が文書全体でどれだけ稀であるかに基づいて測定します。

逆文書頻度(IDF)は、統計的指標であり、 情報検索 and 自然言語処理 to evaluate the importance of a term in a document relative to a collection of documents or corpus. The concept is often combined with Term Frequency (TF) to create the TF-IDF score, which helps determine how relevant a word is within a specific document compared to its occurrence in the entire dataset.

IDFは、総文書数をその用語を含む文書数で割り、その商の対数を取ることで計算される。式は次の通り:

IDF(t) = log(N / df(t))

ここで:

  • IDF(t) is the inverse document frequency of term t.
  • N は、総文書数である。
  • df(t) is the number of documents containing the term t.

IDFの重要性は、その用語の重要性を理解し、文書全体における稀少性に基づいて情報量を測定できる点にあります。 weight of terms that occur very frequently across documents, as these terms provide less discriminative power for identifying relevant documents. In contrast, terms that are rare across the corpus will have a higher IDF score, indicating they are more valuable for distinguishing documents.

For example, common words like ‘the’, ‘is’, and ‘and’ would have a low IDF score, while more unique terms related to a specific topic would score higher, highlighting their relevance. This makes IDF a crucial component in various applications, including search engines, 文書クラスタリング, and text classification, where understanding the significance of terms in context is essential.

コントロール + /