Go offline with the Player FM app!
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Manage episode 493618164 series 3524393
This paper challenges conventional wisdom on small batch sizes in language model training, demonstrating their stability, robustness, and efficiency, while providing guidelines for hyperparameter adjustments and batch size selection.
https://arxiv.org/abs//2507.07101
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2397 episodes
Manage episode 493618164 series 3524393
This paper challenges conventional wisdom on small batch sizes in language model training, demonstrating their stability, robustness, and efficiency, while providing guidelines for hyperparameter adjustments and batch size selection.
https://arxiv.org/abs//2507.07101
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2397 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.