AI Glossary: What Is Information Gain (IG)? Definition & Meaning

Gain d'information is a key en théorie de l'information and apprentissage automatique that quantifies the effectiveness of an attribute in classifying data. Specifically, it measures the reduction in entropy, or uncertainty, associated with a random variable when additional information is introduced.

Entropy, represented as H(X), is a measure of the unpredictability or disorder of a system. When we have a dataset with a target variable (e.g., whether an email is spam or not), the initial entropy reflects our uncertainty about the classification of that variable. By introducing a feature or attribute (such as the presence of certain words in the email), we can partition the dataset into subsets that provide more information about the target variable.

La formule du Gain d'Information (GI) est donnée par :

IG(X, Y) = H(X) – H(X|Y)

Où :

H(X) est l'entropie de l'ensemble de données original.
H(X|Y) est l'entropie conditionnelle de l'ensemble de données étant donné l'attribut Y.

En termes plus simples, le Gain d'Information nous indique combien la connaissance de la valeur de l'attribut Y réduit l'incertitude de prédire X. Un Gain d'Information élevé indique que l'attribut est efficace pour diviser les données en groupes plus homogènes par rapport à la variable cible.

Ce concept est largement utilisé dans les algorithmes d'arbres de décision, such as ID3 (Iterative Dichotomiser 3), where nodes are chosen based on the attribute that provides the highest Information Gain, thus leading to better predictive performance.

En résumé, le Gain d'Information est une mesure fondamentale dans science des données that helps us identify which features or attributes are most informative for predicting outcomes.