80 subscribers
Go offline with the Player FM app!
Podcasts Worth a Listen
SPONSORED


1 My mission to change the narrative of mental health | Glenn Close 13:44
Can AIs do AI R&D? Reviewing REBench Results with Neev Parikh of METR
Manage episode 456850039 series 3452589
In this episode of The Cognitive Revolution, Nathan explores METR's groundbreaking REBench evaluation framework with Neev Parikh. We dive deep into how this new benchmark assesses AI systems' ability to perform real machine learning research tasks, from optimizing GPU kernels to fine-tuning language models. Join us for a fascinating discussion about the current capabilities of AI models like Claude 3.5 and GPT-4, and what their performance tells us about the trajectory of artificial intelligence development.
Check out METR's work:
blog post: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/
paper: https://metr.org/AI_R_D_Evaluation_Report.pdf
jobs: https://hiring.metr.org/
The Cognitive Revolution Ask Me Anything and Listener Survey: https://docs.google.com/forms/d/1aYv2XLID7RqGxj2_Y4_6x9mo_aqXcGCeLw1EQhy4IpY/edit
Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse
SPONSORS:
GiveWell: GiveWell has spent over 17 years researching global health and philanthropy to identify the highest-impact giving opportunities. Over 125,000 donors have contributed more than $2 billion, saving over 200,000 lives through evidence-backed recommendations. First-time donors can have their contributions matched up to $100 before year-end. Visit https://GiveWell.org, select podcast, and enter Cognitive Revolution at checkout to make a difference today.
SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognitive
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive
Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today.
CHAPTERS:
(00:00:00) Teaser
(00:01:04) About the Episode
(00:05:14) Introducing METR
(00:07:36) Specialization of AI Risk
(00:09:52) AI R&D vs. Autonomy
(00:12:41) Benchmark Design Choices
(00:16:04) Benchmark Design Principles (Part 1)
(00:18:54) Sponsors: GiveWell | SelectQuote
(00:21:44) Benchmark Design Principles (Part 2)
(00:22:35) AI vs. Human Evaluation
(00:26:55) Optimizing Runtimes
(00:36:02) Sponsors: Oracle Cloud Infrastructure (OCI) | Weights & Biases RAG++
(00:38:20) AI Myopia
(00:43:37) Optimizing Loss
(00:47:59) Optimizing Win Rate
(00:50:24) Best of K Analysis
(01:02:26) Best of K Limitations
(01:09:04) Agent Interaction Modalities
(01:12:34) Analyzing Benchmark Results
(01:17:16) Model Performance Differences
(01:22:49) Elicitation and Scaffolding
(01:27:08) Context Window & Best of K
(01:35:17) Reward Hacking & Bad Behavior
(01:43:47) Future Directions & Hiring
(01:46:20) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
249 episodes
Can AIs do AI R&D? Reviewing REBench Results with Neev Parikh of METR
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
Manage episode 456850039 series 3452589
In this episode of The Cognitive Revolution, Nathan explores METR's groundbreaking REBench evaluation framework with Neev Parikh. We dive deep into how this new benchmark assesses AI systems' ability to perform real machine learning research tasks, from optimizing GPU kernels to fine-tuning language models. Join us for a fascinating discussion about the current capabilities of AI models like Claude 3.5 and GPT-4, and what their performance tells us about the trajectory of artificial intelligence development.
Check out METR's work:
blog post: https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/
paper: https://metr.org/AI_R_D_Evaluation_Report.pdf
jobs: https://hiring.metr.org/
The Cognitive Revolution Ask Me Anything and Listener Survey: https://docs.google.com/forms/d/1aYv2XLID7RqGxj2_Y4_6x9mo_aqXcGCeLw1EQhy4IpY/edit
Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse
SPONSORS:
GiveWell: GiveWell has spent over 17 years researching global health and philanthropy to identify the highest-impact giving opportunities. Over 125,000 donors have contributed more than $2 billion, saving over 200,000 lives through evidence-backed recommendations. First-time donors can have their contributions matched up to $100 before year-end. Visit https://GiveWell.org, select podcast, and enter Cognitive Revolution at checkout to make a difference today.
SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognitive
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive
Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today.
CHAPTERS:
(00:00:00) Teaser
(00:01:04) About the Episode
(00:05:14) Introducing METR
(00:07:36) Specialization of AI Risk
(00:09:52) AI R&D vs. Autonomy
(00:12:41) Benchmark Design Choices
(00:16:04) Benchmark Design Principles (Part 1)
(00:18:54) Sponsors: GiveWell | SelectQuote
(00:21:44) Benchmark Design Principles (Part 2)
(00:22:35) AI vs. Human Evaluation
(00:26:55) Optimizing Runtimes
(00:36:02) Sponsors: Oracle Cloud Infrastructure (OCI) | Weights & Biases RAG++
(00:38:20) AI Myopia
(00:43:37) Optimizing Loss
(00:47:59) Optimizing Win Rate
(00:50:24) Best of K Analysis
(01:02:26) Best of K Limitations
(01:09:04) Agent Interaction Modalities
(01:12:34) Analyzing Benchmark Results
(01:17:16) Model Performance Differences
(01:22:49) Elicitation and Scaffolding
(01:27:08) Context Window & Best of K
(01:35:17) Reward Hacking & Bad Behavior
(01:43:47) Future Directions & Hiring
(01:46:20) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
249 episodes
All episodes
×
1 Mechanistic Interpretability: Philosophy, Practice & Progress with Goodfire's Dan Balsam & Tom McGrath 1:52:52

1 The Perfect Substrate for AGI, with Replit CEO Amjad Masad 1:01:16

1 The RAISE Act: Minimum Standards for Frontier AI Development, with NY Assembly Member Alex Bores 1:43:36

1 Gemini Robotics – AI for the Physical World, with Keerthana Gopalakrishnan and Ted Xiao of Google DeepMind 1:47:38

1 Titans: Neural Long-Term Memory for LLMs, with author Ali Behrouz 2:11:25

1 Luma Labs' Diffusion Revolution: from Dream Machine to Multimodal Worldsim - Amit Jain, Jiaming Song 1:19:32

1 OpenAI's Identity Crisis: History, Culture & Non-Profit Control with ex-employee Steven Adler 2:03:13

1 AI Control: Using Untrusted Systems Safely with Buck Shlegeris of Redwood Research, from the 80,000 Hours Podcast 2:29:21

1 Blueprint for AI Armageddon: Josh Clymer Imagines AI Takeover, from the Audio Tokens Podcast 2:02:05

1 Fiverr Goes All-In on AI: Empowering Creators, Not Replacing Them, with Micha Kaufman, CEO of Fiverr 1:43:05

1 Securing Superintelligence: National Security, Espionage & AI Control with Jeremie & Edouard Harris 2:09:51

1 Is OpenAI's o3 AGI? Zvi Mowshowitz on Early AI Takeoff, the Mechanize launch, Live Players, & Why p(doom) is Rising 3:08:19

1 AI News Crossover: A Candid Chat with Liron Shapira of Doom Debates 2:29:35

1 Helen Toner: OpenAI Reflections, Adaptation Buffers, and AI in Warfare 1:29:35

1 Is a US-China Thucydides Trap Unavoidable? With David C. Kang from the ChinaTalk Podcast 1:39:16

1 New in Nature: Google Agents Beat Human Doctors, Make Scientific Discoveries – With Vivek Natarajan and Anil Palepu 1:27:57

1 Scaling "Thinking": Gemini 2.5 Tech Lead Jack Rae on Reasoning, Long Context, & the Path to AGI 1:16:28

1 Reward Hacking by Reasoning Models & Loss of Control Scenarios w/ Jeffrey Ladish of Palisade Research, from FLI Podcast 1:32:17

1 Shortwave Rides the Tidal Wave: Inbox Agents, Hyper-Growth & Hiring AI Managers, with CEO Andrew Lee 1:51:39

1 Code Context is King: Augment’s AI Assistant for Professional Software Engineers, with Guy Gur-Ari 1:25:44

1 Unlocking Cells' Secrets: Diffusion, Deconvolution, & Discovery with Siyu He, author of Squidiff & CORAL 1:46:17

1 a16z on AI Voices: Call Centers, Coaches, and Companions with Olivia Moore & Anish Acharya 1:07:35

1 Agency over AI? Allan Dafoe on Technological Determinism & DeepMind's Safety Plans, from 80000 Hours 3:02:28

1 China's Tech Tightrope: Power, Regulation, and the AI Race with Angela Zhang 1:31:56

1 Historic AI Developments & the Emerging Shape of Superintelligence, from the Consistently Candid Podcast 1:57:36

1 Frontier Models for Frontier Science with Professor Derya Unutmaz, Immunologist & ChatGPT Pro Grantee 1:32:34

1 US-China Relations: History, Culture, and AI Competition, with Noah Smith, from Econ 102 1:09:49

1 The Adversarial Mind: Defeating AI Defenses with Nicholas Carlini of Google DeepMind 2:34:38

1 New Jersey’s AI Moonshot: Governor Phil Murphy on Partnerships, Progress, and Preparedness 55:54

1 Inference Scaling, Alignment Faking, Deal Making? Frontier Research with Ryan Greenblatt of Redwood Research 3:21:07

1 An Application-Free Future? Speaking Directly to Data with illumex CEO Inna Tokarev Sela 1:31:26

1 Claude Cooperates! Exploring Cultural Evolution in LLM Societies, with Aron Vallinder & Edward Hughes 1:32:52

1 Software Supernova: Lovable's "Superhuman Full Stack Engineer" to Transform Idea to App in Seconds 1:34:53

1 Software Supernova: Bolt.new - The AI Web App Developer In Your Browser 1:26:53

1 Gemini's Next Frontier: 2.0 Flash, Flash Lite Strategy & Real-Time APIs with Logan K from Google Deepmind 59:45

1 Distributed Training, Decentralized AI: Prime Intellect's Master Plan to Make AI Too Cheap to Meter 2:18:41

1 Aaron Levie, CEO of Box, on Box AI, Enterprise Enthusiasm, and the Evolution of SaaS 56:35

1 Emergency Pod: Reinforcement Learning Works! Reflecting on Chinese Reasoning Models DeepSeek-R1 and Kimi k1.5 1:47:53

1 Material Progress: Developing AI's Scientific Intuition, with Orbital Materials' Jonathan Godwin & Tim Duignan 1:40:50

1 Dodging Latent Space Detectors: Obfuscated Activation Attacks with Luke, Erik, and Scott. 2:10:23

1 Gene Hunting with o1-pro: Reasoning about Rare Diseases with ChatGPT Pro Grantee Dr. Catherine Brownstein 1:33:29

1 AI AMA – Part 2: AI Utopia, Consciousness, and the Future of Work 2:01:36

1 AI AMA – Part 1: OpenAI’s o3, Deliberative Alignment, and AI Surprises of 2024 2:06:57

1 Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai 3:56:09

1 roon's Heroic Duty: Will "the Good Guys" Build AGI First? (from Doom Debates) 1:57:58
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.