AI Glossary: What Is Model Poisoning (MP)? Definition & Meaning

Model poisoning is a type of adversarial attack on machine learning systems where an attacker intentionally manipulates the training data used to build a model. This manipulation can lead to the model learning incorrect patterns or making biased predictions, ultimately undermining its reliability and effectiveness. The attacker typically aims to introduce harmful data points into the dataset, which are designed to mislead the model during the training phase.

In practice, model poisoning can occur in various scenarios, especially in collaborative learning environments where multiple participants contribute to a shared model. For instance, in federated learning, where multiple devices train a model collectively without sharing their data, an attacker may alter their local dataset to influence the overall model’s performance negatively.

There are several techniques that attackers may employ during a model poisoning attack. For example, they might inject data that misrepresents the true distribution of the data, create outliers that skew the model’s learning, or introduce specific examples that push the model to make incorrect predictions on critical tasks. The impact of model poisoning can range from subtle degradation of performance to catastrophic failures when the model is deployed in real-world applications.

To defend against model poisoning, researchers and practitioners employ various strategies, such as anomaly detection to identify suspicious data, robust learning algorithms that are less sensitive to outliers, and regular audits of the training data to ensure its integrity. Understanding model poisoning is crucial for developing resilient AI systems that maintain their performance and ethical standards in the face of potential attacks.