Go offline with the Player FM app!
Podcasts Worth a Listen
SPONSORED


Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening
Manage episode 487468089 series 3524393
This paper critiques GRPO's bias in training language models for theorem proving and introduces the unlikeliness reward to enhance performance and sample diversity, achieving competitive results.
https://arxiv.org/abs//2506.02355
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2437 episodes
Manage episode 487468089 series 3524393
This paper critiques GRPO's bias in training language models for theorem proving and introduces the unlikeliness reward to enhance performance and sample diversity, achieving competitive results.
https://arxiv.org/abs//2506.02355
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2437 episodes
All episodes
×
1 [QA] AlphaGo Moment for Model Architecture Discovery 7:45

1 AlphaGo Moment for Model Architecture Discovery 23:47

1 [QA] Learning without training: The implicit dynamics of in-context learning 8:31

1 Learning without training: The implicit dynamics of in-context learning 13:23

1 [QA] NABLA: Neighborhood Adaptive Block-Level Attention 7:11

1 NABLA: Neighborhood Adaptive Block-Level Attention 12:47

1 [QA] Checklists Are Better Than Reward Models For Aligning Language Models 5:20

1 Checklists Are Better Than Reward Models For Aligning Language Models 13:43

1 [QA] Beyond Binary Rewards: Training LMs to Reason about Their Uncertainty 7:51

1 Beyond Binary Rewards: Training LMs to Reason about Their Uncertainty 15:07

1 [QA] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains 7:12

1 Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains 12:03

1 [QA] Does More Inference-Time Compute Really Help Robustness? 7:44

1 Does More Inference-Time Compute Really Help Robustness? 20:29

1 [QA] Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning 7:51

1 Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning 25:08



1 [QA] The Invisible Leash: Why RLVR May Not Escape Its Origin 8:26

1 The Invisible Leash: Why RLVR May Not Escape Its Origin 21:49

1 [QA] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination 8:49

1 Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination 22:17

1 [QA] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation 7:58

1 Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation 27:15

1 [QA] AGENTSNET: Coordination and Collaborative Reasoning in Multi-Agent LLMs 7:37

1 AGENTSNET: Coordination and Collaborative Reasoning in Multi-Agent LLMs 19:47



1 [QA] Should We Still Pretrain Encoders with Masked Language Modeling? 8:09

1 Should We Still Pretrain Encoders with Masked Language Modeling? 16:52

1 [QA] Token Bottleneck: One Token to Remember Dynamics 7:30

1 Token Bottleneck: One Token to Remember Dynamics 16:06

1 [QA] A Systematic Analysis of Hybrid Linear Attention 7:55

1 A Systematic Analysis of Hybrid Linear Attention 15:40



1 [QA] Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs 8:31

1 Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs 15:32

1 [QA] Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving 8:09

1 Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving 21:33

1 [QA] Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful 7:03

1 Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful 18:57

1 [QA] The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation 7:35

1 The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation 23:36

1 [QA] Cascade: Token-Sharded Private LLM Inference 7:04

1 Cascade: Token-Sharded Private LLM Inference 35:03

1 [QA] Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data 7:28

1 Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data 10:15

1 [QA] Strategic Intelligence in Large Language Models Evidence from evolutionary Game Theory. 7:21

1 Strategic Intelligence in Large Language Models Evidence from evolutionary Game Theory. 34:06

1 [QA] Fast and Simplex: 2-Simplicial Attention in Triton 7:28

1 Fast and Simplex: 2-Simplicial Attention in Triton 17:55

1 [QA] Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning 7:21

1 Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning 15:33

1 [QA] DABstep: Data Agent Benchmark for Multi-step Reasoning 7:54

1 DABstep: Data Agent Benchmark for Multi-step Reasoning 16:50

1 [QA] Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? 8:16

1 Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? 16:52

1 [QA] LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs 8:19

1 LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs 14:25

1 [QA] Performance Prediction for Large Systems via Text-to-Text Regression 8:40

1 Performance Prediction for Large Systems via Text-to-Text Regression 20:32

1 [QA] From Memories to Maps: Mechanisms of In-Context Reinforcement Learning in Transformers 7:47

1 From Memories to Maps: Mechanisms of In-Context Reinforcement Learning in Transformers 20:44

1 [QA] OmniGen2: Exploration to Advanced Multimodal Generation 7:44

1 OmniGen2: Exploration to Advanced Multimodal Generation 32:16

1 [QA] OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling 7:28

1 OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling 25:52

1 [QA] Potemkin Understanding in Large Language Models 8:04

1 Potemkin Understanding in Large Language Models 17:20

1 [QA] Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test 7:49
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.