34 subscribers
Go offline with the Player FM app!
Podcasts Worth a Listen
SPONSORED


1 Tristen Epps and the Scrambled Egg Revelation 56:04
41 - Lee Sharkey on Attribution-based Parameter Decomposition
Manage episode 486543213 series 2844728
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
60 episodes
Manage episode 486543213 series 2844728
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
60 episodes
All episodes
×
1 46 - Tom Davidson on AI-enabled Coups 2:05:26

1 45 - Samuel Albanie on DeepMind's AGI Safety Approach 1:15:42

1 44 - Peter Salib on AI Rights for Human Safety 3:21:33

1 43 - David Lindner on Myopic Optimization with Non-myopic Approval 1:40:59


1 41 - Lee Sharkey on Attribution-based Parameter Decomposition 2:16:11

1 40 - Jason Gross on Compact Proofs and Interpretability 2:36:05

1 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future 20:42

1 38.7 - Anthony Aguirre on the Future of Life Institute 22:39

1 38.6 - Joel Lehman on Positive Visions of AI 15:28

1 38.5 - Adrià Garriga-Alonso on Detecting AI Scheming 27:41



1 39 - Evan Hubinger on Model Organisms of Misalignment 1:45:47

1 38.2 - Jesse Hoogland on Singular Learning Theory 18:18


1 38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems 22:42


1 36 - Adam Shai and Paul Riechers on Computational Mechanics 1:48:27


1 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization 2:17:24



1 32 - Understanding Agency with Jan Kulveit 2:22:29

1 31 - Singular Learning Theory with Daniel Murfet 2:32:07


1 29 - Science of Deep Learning with Vikrant Varma 2:13:46

1 28 - Suing Labs for AI Risk with Gabriel Weil 1:57:30

1 27 - AI Control with Buck Shlegeris and Ryan Greenblatt 2:56:05

1 26 - AI Governance with Elizabeth Seger 1:57:13
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.