Artwork

Content provided by Medical Attention. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Medical Attention or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Ep.10 Are benchmarks broken?

56:53
 
Share
 

Manage episode 490087338 series 3587686
Content provided by Medical Attention. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Medical Attention or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.

0:25 - Technical wrap: what are agents?

13:20 - What are benchmarks?

  • 18:20 - Automated evaluation

  • 20:10 - Benchmarks

  • 37:45 - Human feedback

  • 44:50 - LLM as judge

Read more about the projects we discuss here:

More details in the show notes on our website.

Episodes | Bluesky | [email protected]

  continue reading

12 episodes

Artwork
iconShare
 
Manage episode 490087338 series 3587686
Content provided by Medical Attention. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Medical Attention or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.

0:25 - Technical wrap: what are agents?

13:20 - What are benchmarks?

  • 18:20 - Automated evaluation

  • 20:10 - Benchmarks

  • 37:45 - Human feedback

  • 44:50 - LLM as judge

Read more about the projects we discuss here:

More details in the show notes on our website.

Episodes | Bluesky | [email protected]

  continue reading

12 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play