T

TF-IDF

TF-IDF

TF-IDFは、文書コレクションに対して、文書内の単語の重要性を評価するために使用される統計的指標です。

TF-IDF(Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure that assesses the importance of a word within a document relative to a set of documents, often referred to as a corpus. It is widely used in 情報検索, 自然言語処理, and テキストマイニング.

この測定は、二つの要素から成り立っています:用語頻度(TF)と 逆文書頻度 (IDF). Term Frequency calculates how frequently a term appears in a specific document. The intuition is that the more times a word appears in a document, the more relevant it is to the content of that document. Mathematically, it is expressed as:

TF(t, d) = (文書d内の用語tの出現回数) / (文書d内の総用語数)

On the other hand, Inverse Document Frequency quantifies how much information a word provides, based on how common or rare it is across all documents. Words that are very common across many documents (like ‘the’ or ‘and’) are less informative. IDF is calculated as:

IDF(t, D) = log(文書集合Dの総数 / 用語tを含む文書の数)

これら二つの要素を組み合わせて、TF-IDFスコアは次のように計算されます:

TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)

This score helps highlight keywords that are both relevant to a specific document and not overly common in the broader corpus, thus making it a powerful tool for text analysis, search engines, and レコメンデーションシステム. For example, documents that contain a high TF-IDF score for a particular term are likely to be more relevant to queries involving that term.

コントロール + /