Are you struggling to train large AI models like LLMs and Vision Transformers because you don’t have access to high-end, large-memory GPUs? You’re not alone! This talk addresses that very challenge. Our primary goal is to make large-model training accessible to everyone, regardless of their hardware limitations. How will we achieve this? We’ll dive deep into a powerful solution that combines the best of model architecture scaling (QLoRA) with advanced system scaling (Fully Sharded Data Parallel – FSDP). What will you learn and gain from attending?
The Evolution of Scaling Techniques: Trace the progression of model design up to efficient techniques like Quantization and QLoRA. We’ll also review the journey from various parallelism methods (data, model, pipeline) to the comprehensive solution that FSDP offers.
A Deep Dive into FSDP: Understand the core fundamentals of Fully Sharded Data Parallel. Discover how it slashes memory usage by intelligently sharding model parameters, gradients, and optimizer states across all available devices while parallelizing your data.
Practical Configuration and Best Practices: Get hands-on with key configuration techniques and PyTorch best practices. We’ll cover essential strategies like mixed precision, CPU offloading, activation checkpointing, and optimal parameter wrapping.
Analyzing the Trade-offs: Every powerful technique has its nuances. We’ll analyze the inherent trade-offs of these methods, particularly how inter-GPU communication overhead impacts training time, showing you how to make large-scale training feasible even on smaller GPUs.
Who should attend?
This session is a must-attend for researchers and developers who are keen to train large models like LLMs on limited hardware. If you’re curious about advanced scaling techniques in AI and deep learning, you’ll find immense value in this talk!