AI Glossary: What Is Inverse Document Frequency (IDF)? Definition & Meaning

Frequência de Documento Inversa (IDF) é uma medida estatística usada em recuperação de informações and processamento de linguagem natural to evaluate the importance of a term in a document relative to a collection of documents or corpus. The concept is often combined with Term Frequency (TF) to create the TF-IDF score, which helps determine how relevant a word is within a specific document compared to its occurrence in the entire dataset.

O IDF é calculado pegando o número total de documentos e dividindo pelo número de documentos que contêm o termo, seguido de tirar o logaritmo desse quociente. A fórmula é:

IDF(t) = log(N / df(t))

Onde:

IDF(t) is the inverse document frequency of term t.
N é o número total de documentos.
df(t) is the number of documents containing the term t.

A importância do IDF reside na sua capacidade de reduzir o weight of terms that occur very frequently across documents, as these terms provide less discriminative power for identifying relevant documents. In contrast, terms that are rare across the corpus will have a higher IDF score, indicating they are more valuable for distinguishing documents.

For example, common words like ‘the’, ‘is’, and ‘and’ would have a low IDF score, while more unique terms related to a specific topic would score higher, highlighting their relevance. This makes IDF a crucial component in various applications, including search engines, clusterização de documentos, and text classification, where understanding the significance of terms in context is essential.