Go offline with the Player FM app!
Podcasts Worth a Listen
SPONSORED


1 Close Encounters with UFO Hot Spots: Area 51, Roswell, and the Great ET Road Trip 39:50
[QA] RM-R1: Reward Modeling as Reasoning
Manage episode 480828228 series 3524393
This paper introduces Reasoning Reward Models (REASRMS) to enhance interpretability and performance in reward modeling for large language models, achieving state-of-the-art results through innovative training methods.
https://arxiv.org/abs//2505.02387
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2303 episodes
Manage episode 480828228 series 3524393
This paper introduces Reasoning Reward Models (REASRMS) to enhance interpretability and performance in reward modeling for large language models, achieving state-of-the-art results through innovative training methods.
https://arxiv.org/abs//2505.02387
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2303 episodes
All episodes
×
1 [QA] HYPERSTEER: Activation Steering at Scale with Hypernetworks 7:49

1 HYPERSTEER: Activation Steering at Scale with Hypernetworks 9:15

1 [QA] Accelerating Diffusion LLMs via Adaptive Parallel Decoding 8:08

1 Accelerating Diffusion LLMs via Adaptive Parallel Decoding 21:09

1 [QA] Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning 7:34

1 Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning 16:44

1 [QA] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning 8:08

1 Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning 23:02

1 [QA] ALPHAONE: Reasoning Models Thinking Slow and Fast at Test Time 7:21

1 ALPHAONE: Reasoning Models Thinking Slow and Fast at Test Time 17:12

1 [QA] ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models 7:40

1 ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models 23:32

1 [QA] Are Reasoning Models More Prone to Hallucination? 7:52

1 Are Reasoning Models More Prone to Hallucination? 20:24

1 [QA] How does Transformer Learn Implicit Reasoning? 8:56

1 How does Transformer Learn Implicit Reasoning? 23:21

1 [QA] Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones 7:26

1 Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones 24:00

1 [QA] Maximizing Confidence Alone Improves Reasoning 7:08

1 Maximizing Confidence Alone Improves Reasoning 13:21

1 [QA] Hardware-Efficient Attention for Fast Decoding 7:57

1 Hardware-Efficient Attention for Fast Decoding 30:59

1 [QA] Reinforcing General Reasoning without Verifiers 7:08

1 Reinforcing General Reasoning without Verifiers 17:11

1 [QA] ENIGMATA: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles 8:16

1 ENIGMATA: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles 23:54

1 [QA] Temporal Sampling for Forgotten Reasoning in LLMs 7:04

1 Temporal Sampling for Forgotten Reasoning in LLMs 10:43

1 [QA] Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems 10:15

1 Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems 17:21



1 [QA] General-Reasoner: Advancing LLM Reasoning Across All Domains 7:40

1 General-Reasoner: Advancing LLM Reasoning Across All Domains 17:40

1 [QA] MMaDA: Multimodal Large Diffusion Language Models 8:06

1 MMaDA: Multimodal Large Diffusion Language Models 16:35

1 [QA] Harnessing the Universal Geometry of Embeddings 7:37

1 Harnessing the Universal Geometry of Embeddings 15:55

1 [QA] Panda: A pretrained forecast model for universal representation of chaotic dynamics 7:55

1 Panda: A pretrained forecast model for universal representation of chaotic dynamics 15:30

1 [QA] Pre-training Large Memory Language Models with Internal and External Knowledge 7:31

1 Pre-training Large Memory Language Models with Internal and External Knowledge 20:15

1 [QA] Understanding Prompt Tuning and In-Context Learning via Meta-Learning height2pt 7:28

1 Understanding Prompt Tuning and In-Context Learning via Meta-Learning height2pt 21:39



1 [QA] On the creation of narrow AI: hierarchy and nonlocality of neural network skills 7:21

1 On the creation of narrow AI: hierarchy and nonlocality of neural network skills 18:01

1 [QA] Do Language Models Use Their Depth Efficiently? 7:25

1 Do Language Models Use Their Depth Efficiently? 20:25

1 [QA] Enhancing Latent Computation in Transformers with Latent Tokens 8:42

1 Enhancing Latent Computation in Transformers with Latent Tokens 21:54

1 [QA] Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation 8:11

1 Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation 20:20

1 [QA] Visual Planning: Let's Think Only with Images 7:43

1 Visual Planning: Let's Think Only with Images 18:55


1 [QA] System Prompt Optimization with Meta-Learning 7:41

1 System Prompt Optimization with Meta-Learning 21:51

1 [QA] Revealing economic facts: LLMs know more than they say1 7:35

1 Revealing economic facts: LLMs know more than they say1 20:39

1 [QA] Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures 8:26

1 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures 43:30

1 [QA] Beyond `Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models 7:24

1 Beyond `Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models 14:33

1 [QA] The COT ENCYCLOPEDIA: Analyzing, Predicting, and Controlling how a Reasoning Model will Think 7:42

1 The COT ENCYCLOPEDIA: Analyzing, Predicting, and Controlling how a Reasoning Model will Think 16:35

1 [QA] Adversarial Suffix Filtering: a Defense Pipeline for LLMs 7:28
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.