Go offline with the Player FM app!
#051 Build systems that can be debugged at 4am by tired humans with no context
Manage episode 489223142 series 3585930
Nicolay here,
Today I have the chance to talk to Charity Majors, CEO and co-founder of Honeycomb, who recently has been writing about the cost crisis in observability.
"Your source of truth is production, not your IDE - and if you can't understand your code there, you're flying blind."
The key insight is architecturally simple but operationally transformative: replace your 10-20 observability tools with wide structured events that capture everything about a request in one place. Most teams store the same request data across metrics, logs, traces, APM, and error tracking - creating a 20X cost multiplier while making debugging nearly impossible because you're reconstructing stories from fragments.
Charity's approach flips this: instrument once with rich context, derive everything else from that single source. This isn't just about cost - it's about giving engineers the connective tissue to understand distributed systems. When you can correlate "all requests failing from Android version X in region Y using language pack Z," you find problems in minutes instead of days.
The second is putting developers on call for their own code. This creates the tight feedback loop that makes engineers write more reliable software - because nobody wants to get paged at 3am for their own bugs.
In the podcast, we also touch on:
- Why deploy time is the foundational feedback loop (15 minutes vs 15 hours changes everything)
- The controversial "developers on call" stance and why ops people rarely found companies
- How microservices made everything trace-shaped and killed traditional metrics approaches
- The "normal engineer" philosophy - building for 4am debugging, not peak performance
- AI making "code of unknown quality" the new normal
- Progressive deployment strategies (kibble → dogfood → production)
- and more
💡 Core Concepts
- Wide Structured Events: Capturing all request context in one instrumentation event instead of scattered log lines - enables correlation analysis that's impossible with fragmented data.
- Observability 2.0: Moving from metrics-as-workhorse to structured-data-as-workhorse, where you instrument once and derive metrics/alerts/dashboards from the same rich dataset.
- SLO-based Alerting: Replacing symptom alerts (CPU, memory, disk) with customer-impact alerts that measure whether you're meeting promises to users.
- Progressive Deployment: Gradual rollout through staged environments (kibble → dogfood → production) that builds confidence without requiring 2X infrastructure.
- Trace-shaped Systems: Architecture pattern recognizing that distributed systems problems are fundamentally about correlating events across time and services, not isolated metrics.
📶 Connect with Charity:
📶 Connect with Nicolay:
⏱️ Important Moments
- Gateway Drug to Engineering: [01:04] How IRC and bash tab completion sparked Charity's fascination with Unix command line possibilities
- ADHD and Incident Response: [01:54] Why high-pressure outages brought out her best work - getting "dead calm" when everything's broken
- Code vs. Production Reality: [02:56] Evolution from focusing on code beauty to understanding performance, behavior, and maintenance over time
- The Alexander's Horse Principle: [04:49] Auto-deployment as daily practice - if you grow up deploying constantly, it feels natural by the time you scale
- Production as Source of Truth: [06:32] Why your IDE output doesn't matter if you can't understand your code's intersection with infrastructure and users
- The Logging Evolution: [08:03] Moving from debugger-style spam logs to fewer, wider structured events oriented around units of work
- Bubble Up Anomaly Detection: [10:27] How correlating dimensions reveals that failures cluster around specific Android versions, regions, and feature combinations
- Everything is Trace-Shaped: [12:45] Why microservices complexity is about locating problems in distributed systems, not just identifying them
- AI as Acceleration of Automation: [15:57] Most AI panic could be replaced with "automation" - it's the same pattern, just faster feedback loops
- Non-determinism as Genuinely New: [16:51] The one aspect of AI that's actually novel in software systems, requiring new architectural patterns
- The Cost Crisis: [22:30] How 10-20 observability tools create unsustainable cost multipliers as businesses scale
- SLO Revolution: [28:40] Deleting 90% of alerts by focusing on customer impact instead of system symptoms
- Shrinking Feedback Loops: [34:28] Keeping deploy-to-validation under one hour so engineers can connect actions to outcomes
- Normal Engineer Design: [38:12] Building systems that work for tired humans at 4am, not just heroes during business hours
- The Instrumentation Habit: [23:15] Always looking at your code in production after deployment to build informed instincts about system behavior
- Progressive Deployment Strategy: [36:43] Kibble → Dog Food → Production pipeline for gradual confidence building
- Real Engineering Bar: [49:00] Discussion on what actually makes exceptional vs normal engineers
🛠️ Tools & Tech Mentioned
- Honeycomb - Observability platform for structured events
- OpenTelemetry - Vendor-neutral instrumentation framework
- IRC - Early gateway to computing
- Parse - Mobile backend where Honeycomb's origin story began
📚 Recommended Resources
- "In Praise of Normal Engineers" - Charity's blog post
- "How I Failed" by Tim O'Reilly
- "Looking at the Crux" by Richard Rumelt
- "Fluke" - Book about randomness in history
- "Engineering Management for the Rest of Us" by Sarah Dresner
58 episodes
Manage episode 489223142 series 3585930
Nicolay here,
Today I have the chance to talk to Charity Majors, CEO and co-founder of Honeycomb, who recently has been writing about the cost crisis in observability.
"Your source of truth is production, not your IDE - and if you can't understand your code there, you're flying blind."
The key insight is architecturally simple but operationally transformative: replace your 10-20 observability tools with wide structured events that capture everything about a request in one place. Most teams store the same request data across metrics, logs, traces, APM, and error tracking - creating a 20X cost multiplier while making debugging nearly impossible because you're reconstructing stories from fragments.
Charity's approach flips this: instrument once with rich context, derive everything else from that single source. This isn't just about cost - it's about giving engineers the connective tissue to understand distributed systems. When you can correlate "all requests failing from Android version X in region Y using language pack Z," you find problems in minutes instead of days.
The second is putting developers on call for their own code. This creates the tight feedback loop that makes engineers write more reliable software - because nobody wants to get paged at 3am for their own bugs.
In the podcast, we also touch on:
- Why deploy time is the foundational feedback loop (15 minutes vs 15 hours changes everything)
- The controversial "developers on call" stance and why ops people rarely found companies
- How microservices made everything trace-shaped and killed traditional metrics approaches
- The "normal engineer" philosophy - building for 4am debugging, not peak performance
- AI making "code of unknown quality" the new normal
- Progressive deployment strategies (kibble → dogfood → production)
- and more
💡 Core Concepts
- Wide Structured Events: Capturing all request context in one instrumentation event instead of scattered log lines - enables correlation analysis that's impossible with fragmented data.
- Observability 2.0: Moving from metrics-as-workhorse to structured-data-as-workhorse, where you instrument once and derive metrics/alerts/dashboards from the same rich dataset.
- SLO-based Alerting: Replacing symptom alerts (CPU, memory, disk) with customer-impact alerts that measure whether you're meeting promises to users.
- Progressive Deployment: Gradual rollout through staged environments (kibble → dogfood → production) that builds confidence without requiring 2X infrastructure.
- Trace-shaped Systems: Architecture pattern recognizing that distributed systems problems are fundamentally about correlating events across time and services, not isolated metrics.
📶 Connect with Charity:
📶 Connect with Nicolay:
⏱️ Important Moments
- Gateway Drug to Engineering: [01:04] How IRC and bash tab completion sparked Charity's fascination with Unix command line possibilities
- ADHD and Incident Response: [01:54] Why high-pressure outages brought out her best work - getting "dead calm" when everything's broken
- Code vs. Production Reality: [02:56] Evolution from focusing on code beauty to understanding performance, behavior, and maintenance over time
- The Alexander's Horse Principle: [04:49] Auto-deployment as daily practice - if you grow up deploying constantly, it feels natural by the time you scale
- Production as Source of Truth: [06:32] Why your IDE output doesn't matter if you can't understand your code's intersection with infrastructure and users
- The Logging Evolution: [08:03] Moving from debugger-style spam logs to fewer, wider structured events oriented around units of work
- Bubble Up Anomaly Detection: [10:27] How correlating dimensions reveals that failures cluster around specific Android versions, regions, and feature combinations
- Everything is Trace-Shaped: [12:45] Why microservices complexity is about locating problems in distributed systems, not just identifying them
- AI as Acceleration of Automation: [15:57] Most AI panic could be replaced with "automation" - it's the same pattern, just faster feedback loops
- Non-determinism as Genuinely New: [16:51] The one aspect of AI that's actually novel in software systems, requiring new architectural patterns
- The Cost Crisis: [22:30] How 10-20 observability tools create unsustainable cost multipliers as businesses scale
- SLO Revolution: [28:40] Deleting 90% of alerts by focusing on customer impact instead of system symptoms
- Shrinking Feedback Loops: [34:28] Keeping deploy-to-validation under one hour so engineers can connect actions to outcomes
- Normal Engineer Design: [38:12] Building systems that work for tired humans at 4am, not just heroes during business hours
- The Instrumentation Habit: [23:15] Always looking at your code in production after deployment to build informed instincts about system behavior
- Progressive Deployment Strategy: [36:43] Kibble → Dog Food → Production pipeline for gradual confidence building
- Real Engineering Bar: [49:00] Discussion on what actually makes exceptional vs normal engineers
🛠️ Tools & Tech Mentioned
- Honeycomb - Observability platform for structured events
- OpenTelemetry - Vendor-neutral instrumentation framework
- IRC - Early gateway to computing
- Parse - Mobile backend where Honeycomb's origin story began
📚 Recommended Resources
- "In Praise of Normal Engineers" - Charity's blog post
- "How I Failed" by Tim O'Reilly
- "Looking at the Crux" by Richard Rumelt
- "Fluke" - Book about randomness in history
- "Engineering Management for the Rest of Us" by Sarah Dresner
58 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.