Evaluating LLMs the Right Way: Lessons from Hex's Journey

High Agency: The Podcast for AI Builders

#Tech #Raza Habib #Large Language Models #Generativeai #AI Products #Ai Playbooks

24:35

It’s the very first episode of The Big Pitch with Jimmy Carr and our first guest is Phil Wang! And Phil’s subgenre is…This Place is Evil. We’re talking psychological torture, we’re talking gory death scenes, we’re talking Lorraine Kelly?! The Big Pitch with Jimmy Carr is a brand new comedy podcast where each week a different celebrity guest pitches an idea for a film based on one of the SUPER niche sub-genres on Netflix. From ‘Steamy Crime Movies from the 1970s’ to ‘Australian Dysfunctional Family Comedies Starring A Strong Female Lead’, our celebrity guests will pitch their wacky plot, their dream cast, the marketing stunts, and everything in between. By the end of every episode, Jimmy Carr, Comedian by night / “Netflix Executive” by day, will decide whether the pitch is greenlit or condemned to development hell! Listen on all podcast platforms and watch on the Netflix Is A Joke YouTube Channel . The Big Pitch is a co-production by Netflix and BBC Studios Audio. Jimmy Carr is an award-winning stand-up comedian and writer, touring his brand-new show JIMMY CARR: LAUGHS FUNNY throughout the USA from May to November this year, as well as across the UK and Europe, before hitting Australia and New Zealand in early 2026. All info and tickets for the tour are available at JIMMYCARR.COM Production Coordinator: Becky Carewe-Jeffries Production Manager: Mabel Finnegan-Wright Editor: Stuart Reid Producer: Pete Strauss Executive Producer: Richard Morris Executive Producers for Netflix: Kathryn Huyghue, Erica Brady, and David Markowitz Set Design: Helen Coyston Studios: Tower Bridge Studios Make Up: Samantha Coughlan Cameras: Daniel Spencer Sound: Charlie Emery Branding: Tim Lane Photography: James Hole…

about a year ago 45:39

MP3•Episode home

I recently sat down with Bryan Bischof, AI lead at Hex, to dive deep into how they evaluate LLMs to ship reliable AI agents. Hex has deployed AI assistants that can automatically generate SQL queries, transform data, and create visualizations based on natural language questions. While many teams struggle to get value from LLMs in production, Hex has cracked the code.

In this episode, Bryan shares the hard-won lessons they've learned along the way. We discuss why most teams are approaching LLM evaluation wrong and how Hex's unique framework enabled them to ship with confidence.

Bryan breaks down the key ingredients to Hex's success:
- Choosing the right tools to constrain agent behavior
- Using a reactive DAG to allow humans to course-correct agent plans
- Building granular, user-centric evaluators instead of chasing one "god metric"
- Gating releases on the metrics that matter, not just gaming a score
- Constantly scrutinizing model inputs & outputs to uncover insights

For show notes and a transcript go to:
https://hubs.ly/Q02BdzVP0
-----------------------------------------------------
Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to https://hubs.ly/Q02yV72D0

34 episodes

High Agency: The Podcast for AI Builders

Evaluating LLMs the Right Way: Lessons from Hex's Journey

High Agency: The Podcast for AI Builders

0-10 subscribers

published about a year ago

MP3•Episode home

34 episodes

#Tech #Raza Habib #Large Language Models #Generativeai #AI Products #Ai Playbooks

All episodes

High Agency: The Podcast for AI Builders

1
How Graphite's $50M Series B is Transforming AI Code Review 43:15

20 days ago43:15

43:15

Merrill Lutsky, co-founder and CEO of Graphite, discusses their evolution from stack diff workflows to Diamond, an AI code review agent that just helped secure their $50M Series B. He shares insights on building reliable AI review systems, why over-generating and pruning comments works better than single responses, and the shift from RAG to agentic code browsing. Merrill offers a provocative vision where developers define requirements and AI agents build the code, potentially eliminating traditional IDE coding. This episode provides valuable perspectives on how AI is fundamentally reshaping software development workflows and engineering roles. Chapters: 00:00 - Introduction and Graphite overview 01:58 - Evolution from stack diffs to AI review 07:39 - Diamond: The AI code reviewer explained 10:13 - Human vs AI review: Finding the balance 11:44 - Engineering challenges of reliable AI review 17:38 - Over-generate and prune: A winning strategy 24:49 - From RAG to code browser agents 28:12 - The bitter lesson of AI engineering 30:48 - The future of software engineering 37:33 - Is AI over or under-hyped?…

High Agency: The Podcast for AI Builders

1
The End of Language-Only Models l Amit Jain, Luma AI 40:17

27 days ago40:17

40:17

This week Raza is joined by Amit Jain, CEO and co-founder of Luma AI, to explore why the future of artificial intelligence lies beyond language. Amit shares Luma’s bold mission to build world models through multimodal training and why video is the most overlooked and critical data source in AI today. Chapters: 00:00 - Introduction 03:40 - Competing with Big AI Labs: Language vs. Multimodality 08:09 - Joint Training and Why Current Multimodal Models Fall Short 11:01 - Language is Discrete, the World is Continuous 14:36 - Do These Models Have World Models? 18:18 - Planning, Counterfactuals, and Causal Reasoning in AI 22:08 - Capabilities of Ray 2 and Real-World Use Cases 26:14 - Rethinking Video Length and Creative Workflows 29:18 - Solving Coherence Across Shots and Characters 30:00 - When Will AI Create a Feature-Length Film? 31:27 - What You Can Build with Luma’s API Today 35:49 - Overlooked Ideas and Noise in the AI Industry 38:34 - Why Video is the Missing Link in AI…

High Agency: The Podcast for AI Builders

1
From 0 to $40M in 5 Months: Bolt.new Story with Eric Simons 41:33

10 weeks ago41:33

41:33

Eric Simons discusses the meteoric rise of Bolt.new, an AI-powered web app builder that went from zero to $40 million ARR in just five months. He shares insights on how they built an AI agent capable of creating full-stack web applications from simple prompts, the challenges of rapid growth, and the future of AI in software development. From nearly shutting down the company to becoming one of the fastest-growing AI products in history, Eric offers valuable lessons for anyone building in the AI space. Chapters: 00:00 - Introduction and Bolt.new overview 06:05 - The journey from near-shutdown to rapid growth 13:28 - Challenges of explosive growth and scaling 18:50 - Technical deep dive: Building Bolt.new 26:37 - Debugging and improving AI-generated code 32:09 - Future directions and enterprise adoption 34:11 - Advice for building AI applications 37:03 - The concept of "vibe revenue" in AI startups 39:39 - Is AI over or under-hyped? ------------------------------------------------------------------------------------------------------------------------------------------------ Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
Saving Pharma Companies Billions with AI l Patrick Leung from Faro Health 48:04

11 weeks ago48:04

48:04

In this episode of High Agency, Patrick Leung from Faro Health explains how they're using AI to revolutionize clinical trial design by both generating regulatory documents and extracting insights from thousands of existing trials. Patrick emphasises the essential collaboration between clinical experts and AI engineers when building reliable systems in healthcare's high-stakes environment. Chapters: 00:00 - Introduction 04:26 - Clinical trials before: Microsoft Word Documents 08:17 - Document generation using AI 12:26 - What makes clinical trials so expensive 16:26 - Parsing and processing clinical trial data 18:04 - Challenges with traditional evaluation metrics 21:28 - Importance of domain experts in the evaluation process 24:35 - Collaboration between domain experts and engineering 31:26 - Building a graph-based knowledge system 34:27 - Roles and skillsets required 38:06 - Lessons learned building LLM products 40:56 - Discussion on AI capabilities and limitations 46:07 - Is AI overhyped or underhyped ------------------------------------------------------------------------------------------------------------------------------------------------ Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
100x Hiring Speed with Superhuman Recruiters l Metaview Co-Founder 53:07

13 weeks ago53:07

53:07

In this episode, Raza is joined by Shahriar Tajbakhsh, the co-founder of Metaview. They discuss how Metaview’s AI scribe automates interview note-taking, how AI agents can surface top candidates from thousands of resumes, and why hiring managers should think of AI as a co-worker, not just a tool. Raza's recomended reading: Creating a LLM-as-a-Judge That Drives Business Results . Chapters: 00:00 - Introduction 03:32 - How AI Co-Workers Are Transforming Recruiting 06:21 - Inside MetaView: AI Scribe and Workflow Automation 09:11 - Unlocking Hiring Insights with AI-Driven Conversations 11:30 - Balancing AI Innovation and User Adoption 14:05 - Metaview’s Tech Stack and the Role of LLMs 18:29 - How MetaView Generates Superhuman Interview Notes 23:18 - The Challenges of Building Reliable AI Hiring Agents 32:40 - The Future of AI in Hiring: Automating Job Descriptions 40:26 - AI Co-Workers That Work While You Sleep 47:08 - Why Vertical AI Will Win Over General AI Agents 50:24 - The Underrated Power of Graph-Based AI ------------------------------------------------------------------------------------------------------------------------------------------------ Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
AI Will Replace Command Lines I Ex-Google Tech Lead and Founder at Warp 47:45

15 weeks ago47:45

47:45

In this episode, Raza Habib chats with Zach Lloyd, CEO and founder of Warp, about how AI is transforming the developer experience. They explore how Warp is reimagining the command line, the power of AI-driven automation, and what the future holds for coding workflows. Chapters: 00:00 - Introduction 04:06 - Why the terminal needed reinvention 07:11 - AI’s role in Warp’s evolution 08:55 - Key AI features in Warp 12:49 - Balancing safety, reliability, and usability 19:43 - Challenges in AI-Powered development 22:33 - Changing developer behavior with AI 27:24 - Prompt engineering and context optimization 31:05 - Lessons for building AI products 37:50 - The future of AI in software development 46:42 - Underappreciated AI innovations ------------------------------------------------------------------------------------------------------------------------------------------------ Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
Google Is Dead: How This 144-GPU Startup Is Building Einstein-Level AI Search I Will Bryk | Exa CEO 38:44

17 weeks ago38:44

38:44

Will Bryk, CEO of Exa, sits down with Raza Habib to reveal why traditional search engines are becoming obsolete and how his startup is building an AI-powered search engine for the future. From constructing a massive GPU cluster to predicting AI will surpass human mathematicians by 2026, Will shares fascinating insights about the technological breakthroughs that will reshape society in the coming months. Chapters: 00:00 - Introduction 05:13 - Exa as a Tool for LLMs and Neural Search 06:19 - Introducing "Websets" and Its Use Cases 10:16 - Building a Compute Cluster: Why Own vs. Rent? 12:00 - The Bitter Lesson and Scalability in AI 17:11 - Interesting Use Cases for Exa 19:44 - People Search and CRM Opportunities 21:10 - Predictions for AI Progress and Test-Time Compute 27:10 - Implications of AI on Creative Tasks and Society 29:15 - Automation, Jobs, and the Knowledge Economy 33:57 - What Could Stop AI Progress? 36:22 - Advice for AI Builders and Entrepreneurs ------------------------------------------------------------------------------------------------------------------------------------------------ Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
$100M raised: How Decagon is building better AI agents I Jesse Zhang 41:45

20 weeks ago41:45

41:45

In this episode, Jesse Zhang joins Raza to discuss building cutting-edge AI agents for customer support. They explore how his early passion for LLMs led to creating a company that’s transforming the way businesses like Rippling, Duolingo, and Webflow interact with customers. Jesse breaks down the challenges of scaling AI systems, the importance of customer feedback, and his predictions for the future of AI. Chapters: 00:00 - Introduction and Jesse Zhang's Background 01:17 - First Exposure to LLMs and Building Early Projects 04:32 - Decagon’s Rapid Growth and Differentiation in AI 06:37 - Understanding Decagon’s AI Customer Support Product 10:21 - Challenges in Building High-Performance AI Systems 13:14 - Evolution from Simple RAG to Agent Architectures 16:54 - Measuring Accuracy with Evals and Customer Feedback 19:05 - Balancing Customization and Reusability Across Clients 22:35 - Handling Customer Data and Incremental Deployment 25:21 - Restructuring Support Teams for AI Integration 27:03 - Team Composition and the Role of Domain Expertise 29:19 - Advice for New AI Builders: Customer-Driven Development 32:21 - Key Insights on AI Agents and Enterprise Adoption 36:34 - Predictions for AI Advancements in 2025 39:41 - Is AI Overhyped or Underhyped? 41:07 - Closing Remarks and Final Thoughts ------------------------------------------------------------------------------------------------------------------------------------------------ Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
How GitHub Copilot Became the First LLM-Powered Developer Tool with Ryan Salva 38:53

22 weeks ago38:53

38:53

On this week's episode, former GitHub Copilot lead Ryan Salva breaks down how AI coding tools became ubiquitous almost overnight. They discuss the critical differences between what novice and expert developers expect from AI, why starting with predictive text was both a blessing and a curse, and how the rapid adoption of AI assistance is reshaping the future of software development. Chapters: 00:00 - Introduction 01:09 - The Creation of GitHub Copilot 05:39 - From Prototype to Product: Challenges in Scaling 07:37 - How GitHub Copilot Works Behind the Scenes 11:18 - Metrics That Matter: Evaluating AI Success 14:43 - Building Momentum: What It Feels Like to Launch a Hit 17:51 - The Evolution of AI Tools for Developers 21:13 - Evaluations and Testing in AI Development 26:00 - The Role of Automation and the Future of Coding 30:53 - Will Engineers Still Write Code in the Future? 33:16 - Advice for Aspiring AI Builders 36:51 - Is AI Overhyped or Underhyped? 38:17 - Closing Reflections ---------------------------------------------------------------------------------------------------------------------------------------------- Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
What Gives an AI Founder Staying Power I James Theuerkauf, CEO of Syrup Tech I Sara Ittelson, Partner at Accel 43:36

23 weeks ago43:36

43:36

In this week's episode, Raza speaks with James Theuerkauf, CEO of Syrup Tech, and Sara Ittelson, Partner at Accel, to explore the challenges and opportunities for entrepreneurs in this transformative era. They discuss building AI-first companies and the lessons learned from scaling in a rapidly evolving space. With practical tips on leveraging data, creating competitive advantages, and sustaining passion for the long haul, this episode offers invaluable guidance for founders in AI. Chapters: 00:00 - Introduction and Guest Backgrounds 01:27 - Syrup Tech’s Approach to AI in Retail 03:29 - The Role of AI in Demand Forecasting 08:49 - Building Effective AI Systems and Teams 15:30 - How Generative AI is Shaping Businesses 19:18 - Advice for Founders in the AI Era 28:15 - Building an AI-First Company 33:26 - Innovations and Trends in AI 38:47 - Is AI Overhyped or Underhyped? 42:46 - Closing Thoughts and Reflections -------------------------------------------------------------------------------------------------------------------------------------------------- Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
How to build great AI products with Vanta Software Developer Noam Rubin 40:57

25 weeks ago40:57

40:57

In this episode, Noam Rubin, a Software Developer at Vanta reveals how his team uses data-driven strategies to design, test, and improve cutting-edge AI features. Learn how customer insights, rapid prototyping, and iterative development transform raw ideas into tools that make compliance and security easier for businesses everywhere. Chapters: 00:00 - Introduction 02:47 - The process of building AI products at Vanta 04:51 - The role of customer feedback in product development 06:59 - Integrating AI into security and compliance workflows 08:06 - Using data specifications to guide product development 10:10 - Collaborating with subject matter experts to refine AI models 12:14 - Iterative testing and refining AI features 14:10 - Quality control and ensuring AI accuracy 16:00 - The importance of dogfooding and internal feedback loops 18:23 - Scaling AI features and rolling them out to wider audiences 20:50 - Educating engineers and democratizing AI at Vanta 22:20 - Key lessons learned from building AI products 24:12 - Maintaining AI quality through continuous feedback 26:00 - The future of AI in business and product development…

High Agency: The Podcast for AI Builders

1
Predictions for AI in 2025 I Ex-OpenAI, Ex-Stripe researcher Stanislav Polu 44:27

26 weeks ago44:27

44:27

In this episode of High Agency, former OpenAI researcher Stan Polu shares his journey from AI research to founding Dust, an enterprise AI platform. Stan offers a contrarian view on the future of AI, suggesting we may be hitting a plateau in model capabilities since GPT-4. He discusses why startups should focus on product-market fit before investing in GPUs, shares practical lessons for building AI products, and predicts increased competition between AI labs and API developers. Chapters: 00:00 - Introducing Dust: an enterprise AI platform 06:07 - From Stripe to OpenAI: Stan's journey 10:29 - Why research wasn't enough: building Dust 15:10 - Best practices for building an AI product 20:50 - Is prompt engineering here to stay 23:40 - Understanding language models and their limitations 32:56 - Predictions for AI in 2025 39:53 - Measuring progress toward AGI 42:26 - The true value of AI technology -------------------------------------------------------------------------------------------------------------------------------------------------- Humanloop is the LLM evals platform for enterprises. We give you the tools that top teams use to ship and scale AI with confidence. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
How Replicate is Democratizing AI with Open-Source Resources 36:15

30 weeks ago36:15

36:15

In this episode, we explore how Replicate is breaking down barriers in AI development through its open-source platform. CEO Ben Firshman shares how Replicate enables developers without machine learning expertise to run AI models in the cloud. 00:00 Introduction 00:29 Overview of Replicate 03:13 Replicate's user base 05:45 Enterprise use cases and lowering the AI barrier 07:45 The complexity of traditional AI deployment 10:24 Simplifying AI with Replicate's API 13:50 ControlNets and the challenges of image models 19:42 Fragmentation in AI models: images vs. language 25:05 Customization and multi-model pipelines in production 26:33 Learning by doing: skills for AI engineers 28:44 Applying AI in governments 31:12 Iterative development and co-evolution of AI specs 33:13 Final reflections on AI hype 35:18 Conclusion -------------------------------------------------------------------------------------------------------------------------------------------------- Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
The Principles for Building Excellent AI Features with Superhuman’s Lorilyn McCue 42:35

31 weeks ago42:35

42:35

How do you build AI tools that actually meet users’ needs? In this episode of High Agency, Raza speaks with Lorilyn McCue, the driving force behind Superhuman’s AI-powered features. Lorilyn lays out the principles that guide her team’s work, from continuous learning to prioritizing user feedback. Learn how Superhuman’s "learning-first" approach allows them to fine-tune features like Ask AI and AI-driven summaries, creating practical solutions for today’s professionals. 00:00 - Introduction 04:20 - Overview of the Superhuman 06:50 - Instant Reply and Ask AI 10:00 - Building On-Demand vs. Always-On AI Features 13:45 - Prompt Engineering for Effective Summarization 22:35 - The Importance of Seamless AI Integration in User Workflows 25:10 - Developing Advanced Email Search with Contextual Reasoning 29:45 - Leveraging User Feedback 32:15 - Balancing Customization and Scalability in AI-Generated Emails 36:05 - Approach to Prioritization 39:30 - Real-World Use Cases: The Versatility of Current AI Capabilities 43:15 - Learning and Staying Updated in the Rapidly Evolving AI Field 46:00 - Is AI Overhyped or Underhyped? 49:20 - Final Thoughts and Closing Remarks -------------------------------------------------------------------------------------------------------------------------------------------------- Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to humanloop.com…

High Agency: The Podcast for AI Builders

1
Jeff Huber of Chroma: Building the open-source toolkit for AI Engineering 54:59

33 weeks ago54:59