AI Glossary: What Is FSDP? Definition & Meaning

Parallélisme de données entièrement fragmenté (FSDP)

Fully Sharded Data Parallel (FSDP) is an advanced technique used in the training of large-scale apprentissage automatique models, particularly in apprentissage profond. As models become increasingly complex and data-intensive, traditional entraînement parallèle methods can become inefficient and may not scale well. FSDP addresses these challenges by distributing both the model parameters and the data across multiple devices, such as GPUs or TPUs, in a highly efficient manner.

In FSDP, each device holds only a shard or segment of the model’s parameters, which significantly reduces the memory footprint on each individual device. This sharding process allows for the training of larger models than would be possible on a single device. Additionally, FSDP employs a strategy known as accumulation de gradients, where gradients are calculated on each device and then combined, further optimizing the use of available resources.

The technique is particularly beneficial for training large neural networks, such as transformers, where the size of the model and the amount of training data can be prohibitively large for standard parallélisme de données methods. By implementing FSDP, researchers and practitioners can achieve better scalability, reduced memory usage, and faster training times.

Moreover, FSDP is often integrated with other parallelism strategies, like model parallelism and parallélisme de pipeline, to maximize efficiency. When combined, these strategies can lead to significant performance improvements in the training of state-of-the-art AI models.