0-10 subscribers
Go offline with the Player FM app!
Podcasts Worth a Listen
SPONSORED


1 Phil Wang Pitches Psychological Thriller Starring WHO?! 24:35
Evaluating LLMs the Right Way: Lessons from Hex's Journey
Manage episode 428959173 series 3586305
I recently sat down with Bryan Bischof, AI lead at Hex, to dive deep into how they evaluate LLMs to ship reliable AI agents. Hex has deployed AI assistants that can automatically generate SQL queries, transform data, and create visualizations based on natural language questions. While many teams struggle to get value from LLMs in production, Hex has cracked the code.
In this episode, Bryan shares the hard-won lessons they've learned along the way. We discuss why most teams are approaching LLM evaluation wrong and how Hex's unique framework enabled them to ship with confidence.
Bryan breaks down the key ingredients to Hex's success:
- Choosing the right tools to constrain agent behavior
- Using a reactive DAG to allow humans to course-correct agent plans
- Building granular, user-centric evaluators instead of chasing one "god metric"
- Gating releases on the metrics that matter, not just gaming a score
- Constantly scrutinizing model inputs & outputs to uncover insights
For show notes and a transcript go to:
https://hubs.ly/Q02BdzVP0
-----------------------------------------------------
Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to https://hubs.ly/Q02yV72D0
34 episodes
Manage episode 428959173 series 3586305
I recently sat down with Bryan Bischof, AI lead at Hex, to dive deep into how they evaluate LLMs to ship reliable AI agents. Hex has deployed AI assistants that can automatically generate SQL queries, transform data, and create visualizations based on natural language questions. While many teams struggle to get value from LLMs in production, Hex has cracked the code.
In this episode, Bryan shares the hard-won lessons they've learned along the way. We discuss why most teams are approaching LLM evaluation wrong and how Hex's unique framework enabled them to ship with confidence.
Bryan breaks down the key ingredients to Hex's success:
- Choosing the right tools to constrain agent behavior
- Using a reactive DAG to allow humans to course-correct agent plans
- Building granular, user-centric evaluators instead of chasing one "god metric"
- Gating releases on the metrics that matter, not just gaming a score
- Constantly scrutinizing model inputs & outputs to uncover insights
For show notes and a transcript go to:
https://hubs.ly/Q02BdzVP0
-----------------------------------------------------
Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to https://hubs.ly/Q02yV72D0
34 episodes
All episodes
×
1 How Graphite's $50M Series B is Transforming AI Code Review 43:15

1 The End of Language-Only Models l Amit Jain, Luma AI 40:17

1 From 0 to $40M in 5 Months: Bolt.new Story with Eric Simons 41:33

1 Saving Pharma Companies Billions with AI l Patrick Leung from Faro Health 48:04

1 100x Hiring Speed with Superhuman Recruiters l Metaview Co-Founder 53:07

1 AI Will Replace Command Lines I Ex-Google Tech Lead and Founder at Warp 47:45

1 Google Is Dead: How This 144-GPU Startup Is Building Einstein-Level AI Search I Will Bryk | Exa CEO 38:44

1 $100M raised: How Decagon is building better AI agents I Jesse Zhang 41:45

1 How GitHub Copilot Became the First LLM-Powered Developer Tool with Ryan Salva 38:53

1 What Gives an AI Founder Staying Power I James Theuerkauf, CEO of Syrup Tech I Sara Ittelson, Partner at Accel 43:36

1 How to build great AI products with Vanta Software Developer Noam Rubin 40:57

1 Predictions for AI in 2025 I Ex-OpenAI, Ex-Stripe researcher Stanislav Polu 44:27

1 How Replicate is Democratizing AI with Open-Source Resources 36:15

1 The Principles for Building Excellent AI Features with Superhuman’s Lorilyn McCue 42:35

1 Jeff Huber of Chroma: Building the open-source toolkit for AI Engineering 54:59
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.