AI Glossary: What Is Imbalanced Data? Definition & Meaning

Dados desequilibrados referem-se a uma situação em que aprendizado de máquina and dados útil where the classes or categories within a dataset are not represented equally. This often occurs in classification tasks where one class is significantly more frequent than others. For instance, in a dataset used for detecção de fraudes, there may be thousands of legitimate transactions for every instance of fraud. This imbalance can lead to biased predictions, as machine learning models tend to favor the majority class, resulting in poor performance for the minority class.

Ao treinar em conjuntos de dados desequilibrados, traditional algorithms may achieve high accuracy by simply predicting the majority class most of the time, but this does not reflect true performance in identifying the minority class. Consequently, metrics such as accuracy can be misleading. Instead, practitioners often utilize metrics like precision, recall, and the F1-score, which provide a better picture of model performance regarding both classes.

Para lidar com dados desequilibrados, várias técnicas podem ser empregadas, incluindo:

Métodos de Reamostragem: These involve either oversampling the minority class or undersampling a classe majoritária para alcançar um conjunto de dados mais equilibrado.
Abordagens Algorítmicas: Some algorithms are specifically designed to account for class imbalance, such as aprendizado sensível ao custo métodos que atribuem pesos diferentes às classes com base em sua frequência.
Aumento de Dados: This technique generates synthetic instances of the minority class to increase its representation.

Overall, addressing imbalanced data is crucial for developing robust and reliable machine learning models, particularly in fields like healthcare, fraud detection, and gestão de riscos onde as consequências de uma má classificação podem ser significativas.