Artwork

Content provided by EA Forum Team. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by EA Forum Team or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

“Mitigating Risks from Rouge AI” by Stephen Clare

9:50
 
Share
 

Manage episode 474686709 series 3281452
Content provided by EA Forum Team. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by EA Forum Team or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

Introduction

Misaligned AI systems, which have a tendency to use their capabilities in ways that conflict with the intentions of both developers and users, could cause significant societal harm. Identifying them is seen as increasingly important to inform development and deployment decisions and design mitigation measures. There are concerns, however, that this will prove challenging. For example, misaligned AIs may only reveal harmful behaviors in rare circumstances, or perceive detection attempts as threatening and deploy countermeasures – including deception and sandbagging – to evade them.

For these reasons, developing a range of efforts to detect misaligned behavior, including power-seeking, deception, and sandbagging, among other capabilities, have been proposed. One important indicator, though, has been hiding in plain sight for years. In this post, we identify an underappreciated method that may be both necessary and sufficient to identify misaligned AIs: whether or not they've turned red, i.e. gone rouge.

In [...]

---

Outline:

(01:43) Historical Evidence for Rouge AI

(02:59) Recent Empirical Work

(05:18) Potential Countermeasure

(05:22) The EYES Eval

(06:27) EYES Eval Demonstration

(07:40) Future Research Directions

(08:42) Conclusion

---

First published:
April 1st, 2025

Source:
https://forum.effectivealtruism.org/posts/uKKoj9iqj2cWKsjrt/mitigating-risks-from-rouge-ai

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Metallic robot with glowing red eyes in dark industrial setting.
Humanoid robot with glowing red chest standing in arched doorway.
Cartoon robots and computers showing
HAL 9000's glowing red eye lens from 2001: A Space Odyssey.
Diagram showing five phases of AI security and deployment testing workflow. The image illustrates a comprehensive security framework with color-coded teams (blue, red, developer) working through control measures, attack strategies, model testing, and deployment monitoring. Each phase shows specific roles and interactions between different components of the security system, with a notable
Diagram showing AI response differences between free-tier and paid-tier users. The image illustrates a comparison of how an AI system (
Color picker interface showing orange-coral asterisk shape with RGB values.
Gemini logo in blue and purple gradient with sparkle accent
White floating robot with blue glowing eyes beside color picker panel.
HAL 9000's red camera eye with RGB color adjustment panel.
Black background with red Blossom logo and text about color restriction.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

  continue reading

256 episodes

Artwork
iconShare
 
Manage episode 474686709 series 3281452
Content provided by EA Forum Team. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by EA Forum Team or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

Introduction

Misaligned AI systems, which have a tendency to use their capabilities in ways that conflict with the intentions of both developers and users, could cause significant societal harm. Identifying them is seen as increasingly important to inform development and deployment decisions and design mitigation measures. There are concerns, however, that this will prove challenging. For example, misaligned AIs may only reveal harmful behaviors in rare circumstances, or perceive detection attempts as threatening and deploy countermeasures – including deception and sandbagging – to evade them.

For these reasons, developing a range of efforts to detect misaligned behavior, including power-seeking, deception, and sandbagging, among other capabilities, have been proposed. One important indicator, though, has been hiding in plain sight for years. In this post, we identify an underappreciated method that may be both necessary and sufficient to identify misaligned AIs: whether or not they've turned red, i.e. gone rouge.

In [...]

---

Outline:

(01:43) Historical Evidence for Rouge AI

(02:59) Recent Empirical Work

(05:18) Potential Countermeasure

(05:22) The EYES Eval

(06:27) EYES Eval Demonstration

(07:40) Future Research Directions

(08:42) Conclusion

---

First published:
April 1st, 2025

Source:
https://forum.effectivealtruism.org/posts/uKKoj9iqj2cWKsjrt/mitigating-risks-from-rouge-ai

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Metallic robot with glowing red eyes in dark industrial setting.
Humanoid robot with glowing red chest standing in arched doorway.
Cartoon robots and computers showing
HAL 9000's glowing red eye lens from 2001: A Space Odyssey.
Diagram showing five phases of AI security and deployment testing workflow. The image illustrates a comprehensive security framework with color-coded teams (blue, red, developer) working through control measures, attack strategies, model testing, and deployment monitoring. Each phase shows specific roles and interactions between different components of the security system, with a notable
Diagram showing AI response differences between free-tier and paid-tier users. The image illustrates a comparison of how an AI system (
Color picker interface showing orange-coral asterisk shape with RGB values.
Gemini logo in blue and purple gradient with sparkle accent
White floating robot with blue glowing eyes beside color picker panel.
HAL 9000's red camera eye with RGB color adjustment panel.
Black background with red Blossom logo and text about color restriction.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

  continue reading

256 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide

Listen to this show while you explore
Play