Inside s1: An o1-Style Reasoning Model That Cost Under $50 to Train with Niklas Muennighoff - #721

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

#Artificial Intelligence #Tech News #Artificialintelligence #Machinelearning #Samcharrington #Technology #Thisweekinmachinelearning #Sam Charrington #Thetwimlaipocast #Twimlaipodcast #Tech #News #China #TWIML #Datascience #Science

36:51

Traditional banks often lack personalized service, and local businesses struggle to find true partnership in financial institutions. Meanwhile, higher education faces scrutiny over relevance and ROI in a world where information is nearly free. Harry Allen helped launch Studio Bank to blend technology with high-touch service, fueled by community investment. At Belmont, he's applying the same entrepreneurial mindset to modernize university operations and embed practical learning experiences, like a one-of-a-kind partnership with Dolly Parton, into academia. In this episode, Harry L. Allen, co-founder of Studio Bank and now CFO at Belmont University, unpacks the bold vision behind launching a community-first bank in a city dominated by financial giants. He shares the leadership lessons that shaped his journey, how to lead through crisis, and why mentorship is the key to filling today's leadership vacuum. Key Takeaways Leveraging both financial and social capital creates a unique, community-first banking model. High-tech doesn't mean low-touch, Studio Bank fused innovation with personal relationships. Leadership means showing up, especially during crisis. Universities must shift from being information hubs to delivering real-world experience. Succession and mentorship are vital to cultivating the next generation of leaders. Chapters 00:00 Introduction to Harry L. Allen 01:49 The Birth of Studio Bank 04:29 Leveraging Technology in Community Banking 07:25 The Courage to Start a New Venture 10:37 Leadership Challenges in High Growth 13:02 Leading Through Crisis: The COVID Experience 17:55 Transitioning from Banking to Education 21:16 The Role of Leadership in Higher Education 25:16 Adapting to Challenges in Higher Education 30:04 The Leadership Vacuum in Society 33:17 Advice for Emerging Leaders 35:21 The American Dream and Community Impact No Limit Leadership is the go-to podcast for growth-minded executives, middle managers, and team leaders who want more than surface-level leadership advice. Hosted by executive coach and former Special Forces commander Sean Patton, this show dives deep into modern leadership, self-leadership, and the real-world strategies that build high-performing teams. Whether you're focused on leadership development, building a coaching culture, improving leadership communication, or strengthening team accountability, each episode equips you with actionable insights to unlock leadership potential across your organization. From designing onboarding systems that retain talent to asking better questions that drive clarity and impact, No Limit Leadership helps you lead yourself first so you can lead others better. If you're ready to create a culture of ownership, resilience, and results, this leadership podcast is for you.…

about a year ago 49:29

MP3•Episode home

Today, we're joined by Niklas Muennighoff, a PhD student at Stanford University, to discuss his paper, “S1: Simple Test-Time Scaling.” We explore the motivations behind S1, as well as how it compares to OpenAI's O1 and DeepSeek's R1 models. We dig into the different approaches to test-time scaling, including parallel and sequential scaling, as well as S1’s data curation process, its training recipe, and its use of model distillation from Google Gemini and DeepSeek R1. We explore the novel "budget forcing" technique developed in the paper, allowing it to think longer for harder problems and optimize test-time compute for better performance. Additionally, we cover the evaluation benchmarks used, the comparison between supervised fine-tuning and reinforcement learning, and similar projects like the Hugging Face Open R1 project. Finally, we discuss the open-sourcing of S1 and its future directions.

The complete show notes for this episode can be found at https://twimlai.com/go/721.

758 episodes

Inside s1: An o1-Style Reasoning Model That Cost Under $50 to Train with Niklas Muennighoff - #721

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1,763 subscribers

published about a year ago

MP3•Episode home

The complete show notes for this episode can be found at https://twimlai.com/go/721.

758 episodes

All episodes

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738 1:00:29

2 days ago1:00:29

1:00:29

Today, we're joined by Fatih Porikli, senior director of technology at Qualcomm AI Research for an in-depth look at several of Qualcomm's accepted papers and demos featured at this year’s CVPR conference. We start with “DiMA: Distilling Multi-modal Large Language Models for Autonomous Driving,” an end-to-end autonomous driving system that incorporates distilling large language models for structured scene understanding and safe planning motion in critical "long-tail" scenarios. We explore how DiMA utilizes LLMs' world knowledge and efficient transformer-based models to significantly reduce collision rates and trajectory errors. We then discuss “SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation,” a diffusion-distilled approach that combines generative models with metric depth estimation to produce sharp, accurate monocular depth maps. Additionally, Fatih also shares a look at Qualcomm’s on-device demos, including text-to-3D mesh generation, real-time image-to-video and video-to-video generation, and a multi-modal visual question-answering assistant. The complete show notes for this episode can be found at https://twimlai.com/go/738 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Building the Internet of Agents with Vijoy Pandey - #737 56:13

17 days ago56:13

56:13

Today, we're joined by Vijoy Pandey, SVP and general manager at Outshift by Cisco to discuss a foundational challenge for the enterprise: how do we make specialized agents from different vendors collaborate effectively? As companies like Salesforce, Workday, and Microsoft all develop their own agentic systems, integrating them creates a complex, probabilistic, and noisy environment, a stark contrast to the deterministic APIs of the past. Vijoy introduces Cisco's vision for an "Internet of Agents," a platform to manage this new reality, and its open-source implementation, AGNTCY. We explore the four phases of agent collaboration—discovery, composition, deployment, and evaluation—and dive deep into the communication stack, from syntactic protocols like A2A, ACP, and MCP to the deeper semantic challenges of creating a shared understanding between agents. Vijoy also unveils SLIM (Secure Low-Latency Interactive Messaging), a novel transport layer designed to make agent-to-agent communication quantum-safe, real-time, and efficient for multi-modal workloads. The complete show notes for this episode can be found at ⁠ https://twimlai.com/go/737.…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
LLMs for Equities Feature Forecasting at Two Sigma with Ben Wellington - #736 59:31

24 days ago59:31

59:31

Today, we're joined by Ben Wellington, deputy head of feature forecasting at Two Sigma. We dig into the team’s end-to-end approach to leveraging AI in equities feature forecasting, covering how they identify and create features, collect and quantify historical data, and build predictive models to forecast market behavior and asset prices for trading and investment. We explore the firm's platform-centric approach to managing an extensive portfolio of features and models, the impact of multimodal LLMs on accelerating the process of extracting novel features, the importance of strict data timestamping to prevent temporal leakage, and the way they consider build vs. buy decisions in a rapidly evolving landscape. Lastly, Ben also shares insights on leveraging open-source models and the future of agentic AI in quantitative finance. The complete show notes for this episode can be found at https://twimlai.com/go/736 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Zero-Shot Auto-Labeling: The End of Annotation for Computer Vision with Jason Corso - #735 56:45

4 weeks ago56:45

56:45

Today, we're joined by Jason Corso, co-founder of Voxel51 and professor at the University of Michigan, to explore automated labeling in computer vision. Jason introduces FiftyOne, an open-source platform for visualizing datasets, analyzing models, and improving data quality. We focus on Voxel51’s recent research report, “Zero-shot auto-labeling rivals human performance,” which demonstrates how zero-shot auto-labeling with foundation models can yield to significant cost and time savings compared to traditional human annotation. Jason explains how auto-labels, despite being "noisier" at lower confidence thresholds, can lead to better downstream model performance. We also cover Voxel51's "verified auto-labeling" approach, which utilizes a "stoplight" QA workflow (green, yellow, red light) to minimize human review. Finally, we discuss the challenges of handling decision boundary uncertainty and out-of-domain classes, the differences between synthetic data generation in vision and language domains, and the potential of agentic labeling. The complete show notes for this episode can be found at https://twimlai.com/go/735 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Grokking, Generalization Collapse, and the Dynamics of Training Deep Neural Networks with Charles Martin - #734 1:25:21

5 weeks ago1:25:21

1:25:21

Today, we're joined by Charles Martin, founder of Calculation Consulting, to discuss Weight Watcher, an open-source tool for analyzing and improving Deep Neural Networks (DNNs) based on principles from theoretical physics. We explore the foundations of the Heavy-Tailed Self-Regularization (HTSR) theory that underpins it, which combines random matrix theory and renormalization group ideas to uncover deep insights about model training dynamics. Charles walks us through WeightWatcher’s ability to detect three distinct learning phases—underfitting, grokking, and generalization collapse—and how its signature “layer quality” metric reveals whether individual layers are underfit, overfit, or optimally tuned. Additionally, we dig into the complexities involved in fine-tuning models, the surprising correlation between model optimality and hallucination, the often-underestimated challenges of search relevance, and their implications for RAG. Finally, Charles shares his insights into real-world applications of generative AI and his lessons learned from working in the field. The complete show notes for this episode can be found at https://twimlai.com/go/734 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Google I/O 2025 Special Edition - #733 26:21

6 weeks ago26:21

26:21

Today, I’m excited to share a special crossover edition of the podcast recorded live from Google I/O 2025! In this episode, I join Shawn Wang aka Swyx from the Latent Space Podcast, to interview Logan Kilpatrick and Shrestha Basu Mallick, PMs at Google DeepMind working on AI Studio and the Gemini API, along with Kwindla Kramer, CEO of Daily and creator of the Pipecat open source project. We cover all the highlights from the event, including enhancements to the Gemini models like thinking budgets and thought summaries, native audio output for expressive voice AI, and the new URL Context tool for research agents. The discussion also digs into the Gemini Live API, covering its architecture, the challenges of building real-time voice applications (such as latency and voice activity detection), and new features like proactive audio and asynchronous function calling. Finally, don’t miss our guests’ wish lists for next year’s I/O! The complete show notes for this episode can be found at https://twimlai.com/go/733 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
RAG Risks: Why Retrieval-Augmented LLMs are Not Safer with Sebastian Gehrmann - #732 57:09

7 weeks ago57:09

57:09

Today, we're joined by Sebastian Gehrmann, head of responsible AI in the Office of the CTO at Bloomberg, to discuss AI safety in retrieval-augmented generation (RAG) systems and generative AI in high-stakes domains like financial services. We explore how RAG, contrary to some expectations, can inadvertently degrade model safety. We cover examples of unsafe outputs that can emerge from these systems, different approaches to evaluating these safety risks, and the potential reasons behind this counterintuitive behavior. Shifting to the application of generative AI in financial services, Sebastian outlines a domain-specific safety taxonomy designed for the industry's unique needs. We also explore the critical role of governance and regulatory frameworks in addressing these concerns, the role of prompt engineering in bolstering safety, Bloomberg’s multi-layered mitigation strategies, and vital areas for further work in improving AI safety within specialized domains. The complete show notes for this episode can be found at https://twimlai.com/go/732 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
From Prompts to Policies: How RL Builds Better AI Agents with Mahesh Sathiamoorthy - #731 1:01:25

8 weeks ago1:01:25

1:01:25

Today, we're joined by Mahesh Sathiamoorthy, co-founder and CEO of Bespoke Labs, to discuss how reinforcement learning (RL) is reshaping the way we build custom agents on top of foundation models. Mahesh highlights the crucial role of data curation, evaluation, and error analysis in model performance, and explains why RL offers a more robust alternative to prompting, and how it can improve multi-step tool use capabilities. We also explore the limitations of supervised fine-tuning (SFT) for tool-augmented reasoning tasks, the reward-shaping strategies they’ve used, and Bespoke Labs’ open-source libraries like Curator. We also touch on the models MiniCheck for hallucination detection and MiniChart for chart-based QA. The complete show notes for this episode can be found at https://twimlai.com/go/731 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
How OpenAI Builds AI Agents That Think and Act with Josh Tobin - #730 1:07:27

9 weeks ago1:07:27

1:07:27

Today, we're joined by Josh Tobin, member of technical staff at OpenAI, to discuss the company’s approach to building AI agents. We cover OpenAI's three agentic offerings—Deep Research for comprehensive web research, Operator for website navigation, and Codex CLI for local code execution. We explore OpenAI’s shift from simple LLM workflows to reasoning models specifically trained for multi-step tasks through reinforcement learning, and how that enables agents to more easily recover from failures while executing complex processes. Josh shares insights on the practical applications of these agents, including some unexpected use cases. We also discuss the future of human-AI collaboration in software development, such as with "vibe coding," the integration of tools through the Model Control Protocol (MCP), and the significance of context management in AI-enabled IDEs. Additionally, we highlight the challenges of ensuring trust and safety as AI agents become more powerful and autonomous. The complete show notes for this episode can be found at https://twimlai.com/go/730 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
CTIBench: Evaluating LLMs in Cyber Threat Intelligence with Nidhi Rastogi - #729 56:18

10 weeks ago56:18

56:18

Today, we're joined by Nidhi Rastogi, assistant professor at Rochester Institute of Technology to discuss Cyber Threat Intelligence (CTI), focusing on her recent project CTIBench—a benchmark for evaluating LLMs on real-world CTI tasks. Nidhi explains the evolution of AI in cybersecurity, from rule-based systems to LLMs that accelerate analysis by providing critical context for threat detection and defense. We dig into the advantages and challenges of using LLMs in CTI, how techniques like Retrieval-Augmented Generation (RAG) are essential for keeping LLMs up-to-date with emerging threats, and how CTIBench measures LLMs’ ability to perform a set of real-world tasks of the cybersecurity analyst. We unpack the process of building the benchmark, the tasks it covers, and key findings from benchmarking various LLMs. Finally, Nidhi shares the importance of benchmarks in exposing model limitations and blind spots, the challenges of large-scale benchmarking, and the future directions of her AI4Sec Research Lab, including developing reliable mitigation techniques, monitoring "concept drift" in threat detection models, improving explainability in cybersecurity, and more. The complete show notes for this episode can be found at https://twimlai.com/go/729 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Generative Benchmarking with Kelly Hong - #728 54:17

11 weeks ago54:17

54:17

In this episode, Kelly Hong, a researcher at Chroma, joins us to discuss "Generative Benchmarking," a novel approach to evaluating retrieval systems, like RAG applications, using synthetic data. Kelly explains how traditional benchmarks like MTEB fail to represent real-world query patterns and how embedding models that perform well on public benchmarks often underperform in production. The conversation explores the two-step process of Generative Benchmarking: filtering documents to focus on relevant content and generating queries that mimic actual user behavior. Kelly shares insights from applying this approach to Weights & Biases' technical support bot, revealing how domain-specific evaluation provides more accurate assessments of embedding model performance. We also discuss the importance of aligning LLM judges with human preferences, the impact of chunking strategies on retrieval effectiveness, and how production queries differ from benchmark queries in ambiguity and style. Throughout the episode, Kelly emphasizes the need for systematic evaluation approaches that go beyond "vibe checks" to help developers build more effective RAG applications. The complete show notes for this episode can be found at https://twimlai.com/go/728 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Exploring the Biology of LLMs with Circuit Tracing with Emmanuel Ameisen - #727 1:34:06

13 weeks ago1:34:06

1:34:06

In this episode, Emmanuel Ameisen, a research engineer at Anthropic, returns to discuss two recent papers: "Circuit Tracing: Revealing Language Model Computational Graphs" and "On the Biology of a Large Language Model." Emmanuel explains how his team developed mechanistic interpretability methods to understand the internal workings of Claude by replacing dense neural network components with sparse, interpretable alternatives. The conversation explores several fascinating discoveries about large language models, including how they plan ahead when writing poetry (selecting the rhyming word "rabbit" before crafting the sentence leading to it), perform mathematical calculations using unique algorithms, and process concepts across multiple languages using shared neural representations. Emmanuel details how the team can intervene in model behavior by manipulating specific neural pathways, revealing how concepts are distributed throughout the network's MLPs and attention mechanisms. The discussion highlights both capabilities and limitations of LLMs, showing how hallucinations occur through separate recognition and recall circuits, and demonstrates why chain-of-thought explanations aren't always faithful representations of the model's actual reasoning. This research ultimately supports Anthropic's safety strategy by providing a deeper understanding of how these AI systems actually work. The complete show notes for this episode can be found at https://twimlai.com/go/727 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726 51:45

14 weeks ago51:45

51:45

Today, we're joined by Maohao Shen, PhD student at MIT to discuss his paper, “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.” We dig into how Satori leverages reinforcement learning to improve language model reasoning—enabling model self-reflection, self-correction, and exploration of alternative solutions. We explore the Chain-of-Action-Thought (COAT) approach, which uses special tokens—continue, reflect, and explore—to guide the model through distinct reasoning actions, allowing it to navigate complex reasoning tasks without external supervision. We also break down Satori’s two-stage training process: format tuning, which teaches the model to understand and utilize the special action tokens, and reinforcement learning, which optimizes reasoning through trial-and-error self-improvement. We cover key techniques such “restart and explore,” which allows the model to self-correct and generalize beyond its training domain. Finally, Maohao reviews Satori’s performance and how it compares to other models, the reward design, the benchmarks used, and the surprising observations made during the research. The complete show notes for this episode can be found at https://twimlai.com/go/726 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Waymo's Foundation Model for Autonomous Driving with Drago Anguelov - #725 1:09:07

15 weeks ago1:09:07

1:09:07

Today, we're joined by Drago Anguelov, head of AI foundations at Waymo, for a deep dive into the role of foundation models in autonomous driving. Drago shares how Waymo is leveraging large-scale machine learning, including vision-language models and generative AI techniques to improve perception, planning, and simulation for its self-driving vehicles. The conversation explores the evolution of Waymo’s research stack, their custom “Waymo Foundation Model,” and how they’re incorporating multimodal sensor data like lidar, radar, and camera into advanced AI systems. Drago also discusses how Waymo ensures safety at scale with rigorous validation frameworks, predictive world models, and realistic simulation environments. Finally, we touch on the challenges of generalization across cities, freeway driving, end-to-end learning vs. modular architectures, and the future of AV testing through ML-powered simulation. The complete show notes for this episode can be found at https://twimlai.com/go/725 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Dynamic Token Merging for Efficient Byte-level Language Models with Julie Kallini - #724 50:32

16 weeks ago50:32