Exploring OpenAI's o1-preview and o1-mini

Deep Papers

#Science #Tech #Math #Business #Arize AI

42:18

“I used to be the largest dairy consumer on the planet. I used to eat so much dairy and meat. The more that I looked into the dairy industry, the more that I saw that it was the singular, most inhumane industry on the planet, that we've all been lied to, including myself, for years. I always believed that the picture on the milk carton, the cow standing next to her calf in the green field with the red barn in the back was true. It’s certainly the complete opposite.” – Richard (Kudo) Couto Richard (Kudo) Couto is the founder of Animal Recovery Mission (ARM), an organization solely dedicated to investigating extreme animal cruelty cases. ARM has led high-risk undercover operations that have resulted in the shutdown of illegal slaughterhouses, animal fighting rings, and horse meat trafficking networks. Recently, they released a damning investigation into two industrial dairy farms outside of Phoenix, Arizona supplying milk to Coca-Cola’s Fairlife brand. What they uncovered was systemic animal abuse, environmental violations, and a devastating betrayal of consumer trust. While Fairlife markets its products as being sourced "humanely," ARM’s footage tells a very different story—one of suffering, abuse, and corporate complicity. Despite the evidence, this story has been largely ignored by mainstream media—likely due to Coca-Cola’s massive influence and advertising dollars.…

about a year ago 42:02

MP3•Episode home

OpenAI recently released its o1-preview, which they claim outperforms GPT-4o on a number of benchmarks. These models are designed to think more before answering and handle complex tasks better than their other models, especially science and math questions.
We take a closer look at their latest crop of o1 models, and we also highlight some research our team did to see how they stack up against Claude Sonnet 3.5--using a real world use case.
Read it on our blog: https://arize.com/blog/exploring-openai-o1-preview-and-o1-mini

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

49 episodes

Exploring OpenAI's o1-preview and o1-mini

Deep Papers

24 subscribers

published about a year ago

MP3•Episode home

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

49 episodes

#Science #Tech #Math #Business #Arize AI

All episodes

1
Scalable Chain of Thoughts via Elastic Reasoning 28:54

5 days ago28:54

28:54

In this week's episode, we talk about Elastic Reasoning, a novel framework designed to enhance the efficiency and scalability of large reasoning models by explicitly separating the reasoning process into two distinct phases: thinking and solution . This separation allows for independent allocation of computational budgets, addressing challenges related to uncontrolled output lengths in real-world deployments with strict resource constraints. Our discussion explores how Elastic Reasoning contributes to more concise and efficient reasoning, even in unconstrained settings, and its implications for deploying LRMs in resource-limited environments. Read the paper here: https://arxiv.org/pdf/2505.05315 Sign up for the next discussion & see more AI research: arize.com/ai-research-papers Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Sleep-time Compute: Beyond Inference Scaling at Test-time 30:24

19 days ago30:24

30:24

What if your LLM could think ahead —preparing answers before questions are even asked? In this week's paper read, we dive into a groundbreaking new paper from researchers at Letta, introducing sleep-time compute: a novel technique that lets models do their heavy lifting offline , well before the user query arrives. By predicting likely questions and precomputing key reasoning steps, sleep-time compute dramatically reduces test-time latency and cost—without sacrificing performance. We explore new benchmarks—Stateful GSM-Symbolic, Stateful AIME, and the multi-query extension of GSM—that show up to 5x lower compute at inference, 2.5x lower cost per query, and up to 18% higher accuracy when scaled. You’ll also see how this method applies to realistic agent use cases and what makes it most effective.If you care about LLM efficiency, scalability, or cutting-edge research. Explore more AI research, or sign up to hear the next session live: arize.com/ai-research-papers Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection 27:19

5 weeks ago27:19

27:19

For this week's paper read, we actually dive into our own research. We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost. So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models. We talk about what we built, the process we took, and the bottom line results. 📃 Read the paper: https://arize.com/llm-hallucination-dataset/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam 26:11

7 weeks ago26:11

26:11

This week we talk about modern AI benchmarks, taking a close look at Google's recent Gemini 2.5 release and its performance on key evaluations, notably Humanity's Last Exam (HLE). In the session we covered Gemini 2.5's architecture, its advancements in reasoning and multimodality, and its impressive context window. We also talked about how benchmarks like HLE and ARC AGI 2 help us understand the current state and future direction of AI. Read it on the blog: https://arize.com/blog/ai-benchmark-deep-dive-gemini-humanitys-last-exam/ Sign up to watch the next live recording: https://arize.com/resource/community-papers-reading/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Model Context Protocol (MCP) 15:03

8 weeks ago15:03

15:03

We cover Anthropic’s groundbreaking Model Context Protocol (MCP) . Though it was released in November 2024, we've been seeing a lot of hype around it lately, and thought it was well worth digging into. Learn how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data sources, fundamentally transforming them into capable, context-aware agents. We explore the key benefits of MCP, including enhanced context retention across interactions, improved interoperability for agentic workflows, and the development of more capable AI agents that can execute complex tasks in real-world environments. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs 30:23

12 weeks ago30:23

30:23

This week, we're mixing things up a little bit. Instead of diving deep into a single research paper, we cover the biggest AI developments from the past few weeks. We break down key announcements, including: DeepSeek’s Big Launch Week: A look at FlashMLA (DeepSeek’s new approach to efficient inference) and DeepEP (their enhanced pretraining method). Claude 3.7 & Claude Code: What’s new with Anthropic’s latest model, and what Claude Code brings to the AI coding assistant space. Stay ahead of the curve with this fast-paced recap of the most important AI updates. We'll be back next time with our regularly scheduled programming. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
How DeepSeek is Pushing the Boundaries of AI Development 29:54

13 weeks ago29:54

29:54

This week, we dive into DeepSeek. SallyAnn DeLucia, Product Manager at Arize, and Nick Luzio, a Solutions Engineer, break down key insights on a model that have dominating headlines for its significant breakthrough in inference speed over other models. What’s next for AI (and open source)? From training strategies to real-world performance, here’s what you need to know. Read a summary: https://arize.com/blog/how-deepseek-is-pushing-the-boundaries-of-ai-development/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Multiagent Finetuning: A Conversation with Researcher Yilun Du 30:03

15 weeks ago30:03

30:03

We talk to Google DeepMind Senior Research Scientist (and incoming Assistant Professor at Harvard), Yilun Du, about his latest paper "Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains." This paper introduces a multiagent finetuning framework that enhances the performance and diversity of language models by employing a society of agents with distinct roles, improving feedback mechanisms and overall output quality. The method enables autonomous self-improvement through iterative finetuning, achieving significant performance gains across various reasoning tasks. It's versatile, applicable to both open-source and proprietary LLMs, and can integrate with human-feedback-based methods like RLHF or DPO, paving the way for future advancements in language model development. Read an overview on the blog Watch the full discussion Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Training Large Language Models to Reason in Continuous Latent Space 24:58

18 weeks ago24:58

24:58

LLMs have typically been restricted to reason in the "language space," where chain-of-thought (CoT) is used to solve complex reasoning problems. But a new paper argues that language space may not always be the best for reasoning. In this paper read, we cover an exciting new technique from a team at Meta called Chain of Continuous Thought—also known as "Coconut." In the paper, "Training Large Language Models to Reason in a Continuous Latent Space" explores the potential of allowing LLMs to reason in an unrestricted latent space instead of being constrained by natural language tokens. Read a full breakdown of Coconut on our blog Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods 28:57

21 weeks ago28:57

28:57

We discuss a major survey of work and research on LLM-as-Judge from the last few years. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods" systematically examines the LLMs-as-Judge framework across five dimensions: functionality, methodology, applications, meta-evaluation, and limitations. This survey gives us a birds eye view of the advantages, limitations and methods for evaluating its effectiveness. Read a breakdown on our blog: https://arize.com/blog/llm-as-judge-survey-paper/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies 28:47

23 weeks ago28:47

28:47

LLMs have revolutionized natural language processing, showcasing remarkable versatility and capabilities. But individual LLMs often exhibit distinct strengths and weaknesses, influenced by differences in their training corpora. This diversity poses a challenge: how can we maximize the efficiency and utility of LLMs? A new paper, "Merge, Ensemble, and Cooperate: A Survey on Collaborative Strategies in the Era of Large Language Models," highlights collaborative strategies to address this challenge. In this week's episode, we summarize key insights from this paper and discuss practical implications of LLM collaboration strategies across three main approaches: merging, ensemble, and cooperation. We also review some new open source models we're excited about. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Agent-as-a-Judge: Evaluate Agents with Agents 24:54

26 weeks ago24:54

24:54

This week, we break down the “Agent-as-a-Judge” framework—a new agent evaluation paradigm that’s kind of like getting robots to grade each other’s homework. Where typical evaluation methods focus solely on outcomes or demand extensive manual work, this approach uses agent systems to evaluate agent systems, offering intermediate feedback throughout the task-solving process. With the power to unlock scalable self-improvement, Agent-as-a-Judge could redefine how we measure and enhance agent performance. Let's get into it! Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Introduction to OpenAI's Realtime API 29:56

27 weeks ago29:56

29:56

We break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
Swarm: OpenAI's Experimental Approach to Multi-Agent Systems 46:46

29 weeks ago46:46

46:46

As multi-agent systems grow in importance for fields ranging from customer support to autonomous decision-making, OpenAI has introduced Swarm, an experimental framework that simplifies the process of building and managing these systems. Swarm, a lightweight Python library, is designed for educational purposes, stripping away complex abstractions to reveal the foundational concepts of multi-agent architectures. In this podcast, we explore Swarm’s design, its practical applications, and how it stacks up against other frameworks. Whether you’re new to multi-agent systems or looking to deepen your understanding, Swarm offers a straightforward, hands-on way to get started. Read a Summary on the Blog Watch on YouTube Sign up for Upcoming Paper Readings Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

1
KV Cache Explained 4:19

30 weeks ago4:19