AI Glossary: What Is Imbalanced Dataset (ID)? Definition & Meaning

Conjunto de Dados Desequilibrado

An imbalanced dataset occurs when the distribution of classes in a dataset is not uniform, meaning that some classes are represented significantly more than others. This is a common issue in aprendizado de máquina and can lead to biased models that perform well on the classe majoritária but poorly on the classe minoritária.

For instance, in a medical diagnosis application, if 95% of the data points represent healthy patients and only 5% represent patients with a rare disease, the model may learn to simply predict ‘healthy’ most of the time to achieve high accuracy. This can result in the model failing to accurately identify cases of the rare disease, which can have serious real-world implications.

Conjuntos de Dados Desequilibrados can arise in various domains, including fraud detection, disease classification, and customer churn prediction, among others. When the classes are imbalanced, traditional performance metrics like accuracy can be misleading. For example, a model that predicts the majority class for all instances can still achieve high accuracy while failing to detect instances of the minority class.

Para enfrentar os desafios apresentados por conjuntos de dados desequilibrados, várias técnicas podem ser empregadas:

Métodos de Reamostragem: These include oversampling the minority class (adding more instances) or undersampling o undersampling da classe majoritária (removendo instâncias) para criar um conjunto de dados mais equilibrado.
Ajustes Algorítmicos: Some machine learning algorithms can be modified to give more weight to the minority class during training, helping the model learn to recognize it better.
Técnicas de Conjunto: Techniques like bagging and boosting podem combinar múltiplos modelos para melhorar a previsão da classe minoritária.

Compreender e abordar a questão dos conjuntos de dados desequilibrados é fundamental para desenvolver modelos de aprendizado de máquina robustos que tenham bom desempenho em todas as classes.