AI Glossary: What Is Gini Impurity? Definition & Meaning

Gini Impurity is a statistical measure that quantifies the impurity or disorder in a dataset. It is commonly used in machine learning, particularly in the construction of decision trees, to determine how well a split separates classes in classification tasks. The Gini Impurity is calculated using the formula:

Gini = 1 – ∑(p_i)²

where p_i represents the proportion of instances belonging to class i. The value of Gini Impurity ranges from 0 to 1, where:

0 indicates a perfectly pure dataset (all instances belong to a single class), and
1 indicates maximum impurity (instances are evenly distributed across classes).

In practice, Gini Impurity is calculated for each possible split in the dataset. The split that results in the lowest Gini Impurity is chosen, as it implies that the resulting child nodes are more homogeneous compared to the parent node. This measure is favored for its computational efficiency and its ability to encourage diversity among the classes in the resulting splits.

Overall, Gini Impurity is an essential concept in decision tree algorithms, contributing to the model’s ability to classify data effectively and accurately.