Artwork

Content provided by Philip - Host of AI Explained YT. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Philip - Host of AI Explained YT or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

When Will AI Models Blackmail You, and Why?

26:19
 
Share
 

Manage episode 490586539 series 3611272
Content provided by Philip - Host of AI Explained YT. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Philip - Host of AI Explained YT or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

In the last few days Anthropic have released an impressive honest account of how all models blackmail, no matter what goal they have, and despite prompt warnings, and other preventions. But do these models *want* this?
Thanks to Storyblocks for sponsoring this video! Download unlimited stock media at one set price with Storyblocks: storyblocks.com/AIExplained
AI Insiders ($9!): https://www.patreon.com/AIExplained
Chapters:
00:00 - Introduction
01:20 - What prompts blackmail?
02:44 - Blackmail walkthrough
06:04 - ‘American interests’
08:00 - Inherent desire?
10:45 - Switching Goals
11:35 - Murder
12:22 - Realizing it’s a scenario?
15:02 - Prompt engineering fix?
16:27 - Any fixes?
17:45 - Chekov’s Gun
19:25 - Job implications
21:19 - Bonus Details
Report: https://www.anthropic.com/research/agentic-misalignment
30 Page Appendices: https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Announcement: https://x.com/AnthropicAI/status/1936144602446082431?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Etweet
OpenAI Files: https://www.openaifiles.org/
Grok 4 News: https://x.com/RonFilipkowski/status/1936372579607912473
Claude 4 Report Card: https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf
New Apollo Research: https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming
Interesting Reflections: https://nostalgebraist.tumblr.com/post/785766737747574784/the-void
Non-hype Newsletter: https://signaltonoise.beehiiv.com/

  continue reading

30 episodes

Artwork
iconShare
 
Manage episode 490586539 series 3611272
Content provided by Philip - Host of AI Explained YT. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Philip - Host of AI Explained YT or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

In the last few days Anthropic have released an impressive honest account of how all models blackmail, no matter what goal they have, and despite prompt warnings, and other preventions. But do these models *want* this?
Thanks to Storyblocks for sponsoring this video! Download unlimited stock media at one set price with Storyblocks: storyblocks.com/AIExplained
AI Insiders ($9!): https://www.patreon.com/AIExplained
Chapters:
00:00 - Introduction
01:20 - What prompts blackmail?
02:44 - Blackmail walkthrough
06:04 - ‘American interests’
08:00 - Inherent desire?
10:45 - Switching Goals
11:35 - Murder
12:22 - Realizing it’s a scenario?
15:02 - Prompt engineering fix?
16:27 - Any fixes?
17:45 - Chekov’s Gun
19:25 - Job implications
21:19 - Bonus Details
Report: https://www.anthropic.com/research/agentic-misalignment
30 Page Appendices: https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Announcement: https://x.com/AnthropicAI/status/1936144602446082431?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Etweet
OpenAI Files: https://www.openaifiles.org/
Grok 4 News: https://x.com/RonFilipkowski/status/1936372579607912473
Claude 4 Report Card: https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf
New Apollo Research: https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming
Interesting Reflections: https://nostalgebraist.tumblr.com/post/785766737747574784/the-void
Non-hype Newsletter: https://signaltonoise.beehiiv.com/

  continue reading

30 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play