Go offline with the Player FM app!
Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability
Manage episode 493284591 series 3586723
Eric Ho is building Goodfire to solve one of AI’s most critical challenges: understanding what’s actually happening inside neural networks. His team is developing techniques to understand, audit and edit neural networks at the feature level. Eric discusses breakthrough results in resolving superposition through sparse autoencoders, successful model editing demonstrations and real-world applications in genomics with Arc Institute's DNA foundation models. He argues that interpretability will be critical as AI systems become more powerful and take on mission-critical roles in society.
Hosted by Sonya Huang and Roelof Botha, Sequoia Capital
Mentioned in this episode:
Mech interp: Mechanistic interpretability, list of important papers here
Phineas Gage: 19th century railway engineer who lost most of his brain’s left frontal lobe in an accident. Became a famous case study in neuroscience.
Human Genome Project: Effort from 1990-2003 to generate the first sequence of the human genome which accelerated the study of human biology
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Zoom In: An Introduction to Circuits: First important mechanistic interpretability paper from OpenAI in 2020
Superposition: Concept from physics applied to interpretability that allows neural networks to simulate larger networks (e.g. more concepts than neurons)
Apollo Research: AI safety company that designs AI model evaluations and conducts interpretability research
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023 Anthropic paper that uses a sparse autoencoder to extract interpretable features; followed by Scaling Monosemanticity
Under the Hood of a Reasoning Model: 2025 Goodfire paper that interprets DeepSeek’s reasoning model R1
Auto-interpretability: The ability to use LLMs to automatically write explanations for the behavior of neurons in LLMs
Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model. (see episode with Arc co-founder Patrick Hsu)
Paint with Ember: Canvas interface from Goodfire that lets you steer an LLM’s visual output in real time (paper here)
Model diffing: Interpreting how a model differs from checkpoint to checkpoint during finetuning
Feature steering: The ability to change the style of LLM output by up or down weighting features (e.g. talking like a pirate vs factual information about the Andromeda Galaxy)
Weight based interpretability: Method for directly decomposing neural network parameters into mechanistic components, instead of using features
The Urgency of Interpretability: Essay by Anthropic founder Dario Amodei
On the Biology of a Large Language Model: Goodfire collaboration with Anthropic
55 episodes
Manage episode 493284591 series 3586723
Eric Ho is building Goodfire to solve one of AI’s most critical challenges: understanding what’s actually happening inside neural networks. His team is developing techniques to understand, audit and edit neural networks at the feature level. Eric discusses breakthrough results in resolving superposition through sparse autoencoders, successful model editing demonstrations and real-world applications in genomics with Arc Institute's DNA foundation models. He argues that interpretability will be critical as AI systems become more powerful and take on mission-critical roles in society.
Hosted by Sonya Huang and Roelof Botha, Sequoia Capital
Mentioned in this episode:
Mech interp: Mechanistic interpretability, list of important papers here
Phineas Gage: 19th century railway engineer who lost most of his brain’s left frontal lobe in an accident. Became a famous case study in neuroscience.
Human Genome Project: Effort from 1990-2003 to generate the first sequence of the human genome which accelerated the study of human biology
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Zoom In: An Introduction to Circuits: First important mechanistic interpretability paper from OpenAI in 2020
Superposition: Concept from physics applied to interpretability that allows neural networks to simulate larger networks (e.g. more concepts than neurons)
Apollo Research: AI safety company that designs AI model evaluations and conducts interpretability research
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023 Anthropic paper that uses a sparse autoencoder to extract interpretable features; followed by Scaling Monosemanticity
Under the Hood of a Reasoning Model: 2025 Goodfire paper that interprets DeepSeek’s reasoning model R1
Auto-interpretability: The ability to use LLMs to automatically write explanations for the behavior of neurons in LLMs
Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model. (see episode with Arc co-founder Patrick Hsu)
Paint with Ember: Canvas interface from Goodfire that lets you steer an LLM’s visual output in real time (paper here)
Model diffing: Interpreting how a model differs from checkpoint to checkpoint during finetuning
Feature steering: The ability to change the style of LLM output by up or down weighting features (e.g. talking like a pirate vs factual information about the Andromeda Galaxy)
Weight based interpretability: Method for directly decomposing neural network parameters into mechanistic components, instead of using features
The Urgency of Interpretability: Essay by Anthropic founder Dario Amodei
On the Biology of a Large Language Model: Goodfire collaboration with Anthropic
55 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.