La fréquence inverse de document (IDF) est une mesure statistique utilisée dans la récupération d'informations and traitement du langage naturel to evaluate the importance of a term in a document relative to a collection of documents or corpus. The concept is often combined with Term Frequency (TF) to create the TF-IDF score, which helps determine how relevant a word is within a specific document compared to its occurrence in the entire dataset.
L'IDF est calculée en prenant le nombre total de documents et en le divisant par le nombre de documents contenant le terme, puis en prenant le logarithme de ce quotient. La formule est :
IDF(t) = log(N / df(t))
Où :
- IDF(t) is the inverse document frequency of term t.
- N est le nombre total de documents.
- df(t) is the number of documents containing the term t.
L'importance de l'IDF réside dans sa capacité à réduire le weight of terms that occur very frequently across documents, as these terms provide less discriminative power for identifying relevant documents. In contrast, terms that are rare across the corpus will have a higher IDF score, indicating they are more valuable for distinguishing documents.
For example, common words like ‘the’, ‘is’, and ‘and’ would have a low IDF score, while more unique terms related to a specific topic would score higher, highlighting their relevance. This makes IDF a crucial component in various applications, including search engines, regroupement de documents, and text classification, where understanding the significance of terms in context is essential.