AI Glossary: What Is Information Gain (IG)? Definition & Meaning

Ganancia de información is a key concepto en la teoría de la información and aprendizaje automático that quantifies the effectiveness of an attribute in classifying data. Specifically, it measures the reduction in entropy, or uncertainty, associated with a random variable when additional information is introduced.

Entropy, represented as H(X), is a measure of the unpredictability or disorder of a system. When we have a dataset with a target variable (e.g., whether an email is spam or not), the initial entropy reflects our uncertainty about the classification of that variable. By introducing a feature or attribute (such as the presence of certain words in the email), we can partition the dataset into subsets that provide more information about the target variable.

La fórmula para la Ganancia de Información (GI) es la siguiente:

IG(X, Y) = H(X) – H(X|Y)

Donde:

H(X) es la entropía del conjunto de datos original.
H(X|Y) es la entropía condicional del conjunto de datos dado el atributo Y.

En términos más simples, la Ganancia de Información nos dice cuánto reduce la incertidumbre de predecir X el conocer el valor del atributo Y. Una alta Ganancia de Información indica que el atributo es efectivo para dividir los datos en grupos más homogéneos respecto a la variable objetivo.

Este concepto se usa ampliamente en algoritmos de árboles de decisión, such as ID3 (Iterative Dichotomiser 3), where nodes are chosen based on the attribute that provides the highest Information Gain, thus leading to better predictive performance.

En resumen, la Ganancia de Información es una medida fundamental en ciencia de datos that helps us identify which features or attributes are most informative for predicting outcomes.