La prédiction hors-échantillon est un concept crucial en apprentissage automatique and statistics, referring to the practice of evaluating a model’s performance on a dataset that was not used during the training phase. This approach helps to assess how well the model generalizes to new, unseen data, which is crucial for ensuring that the model is not merely memorizing the données d'entraînement mais apprend plutôt à identifier les schémas sous-jacents.
Dans le contexte de l'évaluation de modèles, out-of-sample prediction typically involves splitting the available data into two subsets: the training set, which is used to train the model, and the test set (or validation set), which is reserved for testing the model’s performance. The model is trained on the training set, and its predictions are then compared to the actual outcomes in the test set. This process allows researchers and practitioners to estimate how the model will perform in real-world applications.
Il existe différentes stratégies pour mettre en œuvre la prédiction hors-échantillon, notamment :
- Méthode de réserve : Diviser l’ensemble de données en un ensemble d’entraînement et un ensemble de test séparé.
- Validation croisée : A technique where the data is divided into multiple subsets, and the model is trained and validated multiple times, ensuring that each data point is used for both training and testing.
- Série Temporelle Séparation : For time-sensitive data, this method respects the temporal order of observations when splitting the data.
La prédiction hors-échantillon est essentielle pour éviter overfitting, where a model performs well on training data but poorly on new data. By validating a model using out-of-sample data, practitioners can ensure that their models are robust, reliable, and ready for deployment in real-world scenarios.