DataStates-LLM: Scalable Checkpointing for Transformer Models Using Composable State Providers

arXiv:2601.16956v1 Announce Type: cross Abstract: The rapid growth of Large Transformer-based models, specifically Large Language Models (LLMs), now scaling to trillions of parameters, has necessitated training across thousands of GPUs using complex hybrid parallelism strategies (e.g., data, tensor, and pipeline parallelism). Checkpointing this massive, distributed state is critical for a wide range of use cases, such as resilience, suspend-resume, investigating undesirable training trajectories...

arXiv cs.AI ·Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae ·
compartilhar: