I

Frecuencia de Documento Inversa

IDF

La frecuencia inversa de documento (IDF) mide cuánta información proporciona una palabra, basada en su rareza en los documentos.

La Frecuencia de Documento Inversa (IDF) es una medida estadística utilizada en recuperación de información and procesamiento de lenguaje natural to evaluate the importance of a term in a document relative to a collection of documents or corpus. The concept is often combined with Term Frequency (TF) to create the TF-IDF score, which helps determine how relevant a word is within a specific document compared to its occurrence in the entire dataset.

La IDF se calcula tomando el número total de documentos y dividiéndolo por el número de documentos que contienen el término, seguido de tomar el logaritmo de ese cociente. La fórmula es:

IDF(t) = log(N / df(t))

Donde:

  • IDF(t) is the inverse document frequency of term t.
  • N es el número total de documentos.
  • df(t) is the number of documents containing the term t.

La importancia del IDF radica en su capacidad para reducir el weight of terms that occur very frequently across documents, as these terms provide less discriminative power for identifying relevant documents. In contrast, terms that are rare across the corpus will have a higher IDF score, indicating they are more valuable for distinguishing documents.

For example, common words like ‘the’, ‘is’, and ‘and’ would have a low IDF score, while more unique terms related to a specific topic would score higher, highlighting their relevance. This makes IDF a crucial component in various applications, including search engines, agrupamiento de documentos, and text classification, where understanding the significance of terms in context is essential.

oEmbed (JSON) + /