TF-IDF (Termfrequenz-Inverse Dokumentenhäufigkeit)
TF-IDF is a statistical measure that assesses the importance of a word within a document relative to a set of documents, often referred to as a corpus. It is widely used in dem Informationsretrieval, der Verarbeitung natürlicher Sprache, and Textminen.
Das Maß besteht aus zwei Komponenten: Term Frequency (TF) und Inverse Dokumentenfrequenz (IDF). Term Frequency calculates how frequently a term appears in a specific document. The intuition is that the more times a word appears in a document, the more relevant it is to the content of that document. Mathematically, it is expressed as:
TF(t, d) = (Anzahl der Vorkommen des Begriffs t in Dokument d) / (Gesamtzahl der Begriffe in Dokument d)
On the other hand, Inverse Document Frequency quantifies how much information a word provides, based on how common or rare it is across all documents. Words that are very common across many documents (like ‘the’ or ‘and’) are less informative. IDF is calculated as:
IDF(t, D) = log (Gesamtzahl der Dokumente in D / Anzahl der Dokumente, die Begriff t enthalten)
Durch die Kombination dieser beiden Komponenten wird der TF-IDF-Wert berechnet als:
TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)
This score helps highlight keywords that are both relevant to a specific document and not overly common in the broader corpus, thus making it a powerful tool for text analysis, search engines, and Empfehlungssystemen. For example, documents that contain a high TF-IDF score for a particular term are likely to be more relevant to queries involving that term.