Apprentissage actif
L'apprentissage actif est une spécialisation d'apprentissage automatique where a model is capable of selecting the data it learns from, rather than passively receiving all available data. This approach is particularly useful in scenarios where données étiquetées est rare ou coûteux à obtenir.
Dans l'apprentissage automatique traditionnel, les modèles sont entraînés en utilisant un dataset that has been pre-labeled. However, in Active Learning, the model identifies which data points it finds most informative and requests labels for those specific instances. This process allows the model to focus on examples that will maximize its learning efficiency, thereby improving its accuracy avec moins d'exemples étiquetés.
L'apprentissage actif implique généralement une processus itératif. Initially, a small subset of data is labeled and used to train the model. The model then assesses the remaining unlabeled data and selects instances it is uncertain about or predicts will provide the most benefit to its learning. These selected instances are then labeled by an oracle (often a human expert) and added to the training set. The model is retrained with this new data, and the cycle continues until a desired performance level is reached or labeling resources are exhausted.
Les stratégies courantes utilisées dans l'apprentissage actif incluent :
- Incertitude Échantillonnage : Sélectionner les instances pour lesquelles le modèle est le moins confiant dans ses prédictions.
- Requête par comité : Utilizing multiple models to explore instances with the highest disagreement among predictions.
- Changement de modèle attendu : Choosing instances that would lead to the most significant change in the model if labeled.
L'apprentissage actif est largement utilisé dans des domaines comme traitement du langage naturel, computer vision, and medical diagnostics, where acquiring labeled data can be costly or time-consuming. By intelligently selecting which data to learn from, Active Learning enhances model performance while minimizing the need for extensive labeled datasets.