Artwork

Content provided by Queue-it ApS. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Queue-it ApS or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

From Chaos to Reliability with Gremlin CEO Kolton Andrus

44:52
 
Share
 

Manage episode 491832439 series 3661258
Content provided by Queue-it ApS. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Queue-it ApS or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

In this episode, Kolton Andrus, Founder and CEO of Gremlin deep dives into all things chaos engineering and reliability testing. Kolton shares his journey from leading reliability efforts at Amazon and Netflix to founding Gremlin, an enterprise reliability platform. They discuss what it really takes to build resilient systems, the cultural shift required to prioritize reliability, and how Gremlin is working to reshape accountability in engineering teams. From testing dependencies to aligning incentives, this conversation is packed with real-world insights into scaling systems (and teams) that don't break under pressure.

Episode page

---

Kolton Andrus is the CEO and founder of Gremlin. Prior, he focused on building and operating reliable systems at Netflix and Amazon. At both companies he operated systems at scale, managed company wide incidents and helped build out their respective reliability programs and toolsets.

Host Jose Quaresma is the VP of Technical Engagement at Queue-it, working on the frontlines with some of the world’s biggest businesses on their busiest days, from Ticketmaster to Zalando to Home Office U.K. Each week, he’ll be joined by experts across industries, uncovering how major organizations design, build, and deploy systems that perform at scale.

This podcast is hosted by José Quaresma, researched by Joseph Thwaites and produced by Perseu Mandillo.

  • (00:00) - Intro & Guest: Kolton Andrus
  • (04:20) - Founding Gremlin (2016)
  • (08:47) - Rewarding Invisible Reliability Work
  • (12:27) - Proving Reliability’s Business Value
  • (15:21) - Rethinking the “Chaos Engineering” Label
  • (20:18) - Chaos Testing to Reliability Scores
  • (24:25) - Spreading Reliability Culture Across Teams
  • (28:50) - Safe, Incremental Failure Testing in Prod
  • (33:30) - Load + Fault Testing for Peak Traffic
  • (36:30) - AI’s Opportunities & Risks for Ops
  • (39:30) - Defining Scalability as Elasticity
  • (44:18) - Key Takeaways & Farewell

© Queue-it, 2025
  continue reading

7 episodes

Artwork
iconShare
 
Manage episode 491832439 series 3661258
Content provided by Queue-it ApS. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Queue-it ApS or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

In this episode, Kolton Andrus, Founder and CEO of Gremlin deep dives into all things chaos engineering and reliability testing. Kolton shares his journey from leading reliability efforts at Amazon and Netflix to founding Gremlin, an enterprise reliability platform. They discuss what it really takes to build resilient systems, the cultural shift required to prioritize reliability, and how Gremlin is working to reshape accountability in engineering teams. From testing dependencies to aligning incentives, this conversation is packed with real-world insights into scaling systems (and teams) that don't break under pressure.

Episode page

---

Kolton Andrus is the CEO and founder of Gremlin. Prior, he focused on building and operating reliable systems at Netflix and Amazon. At both companies he operated systems at scale, managed company wide incidents and helped build out their respective reliability programs and toolsets.

Host Jose Quaresma is the VP of Technical Engagement at Queue-it, working on the frontlines with some of the world’s biggest businesses on their busiest days, from Ticketmaster to Zalando to Home Office U.K. Each week, he’ll be joined by experts across industries, uncovering how major organizations design, build, and deploy systems that perform at scale.

This podcast is hosted by José Quaresma, researched by Joseph Thwaites and produced by Perseu Mandillo.

  • (00:00) - Intro & Guest: Kolton Andrus
  • (04:20) - Founding Gremlin (2016)
  • (08:47) - Rewarding Invisible Reliability Work
  • (12:27) - Proving Reliability’s Business Value
  • (15:21) - Rethinking the “Chaos Engineering” Label
  • (20:18) - Chaos Testing to Reliability Scores
  • (24:25) - Spreading Reliability Culture Across Teams
  • (28:50) - Safe, Incremental Failure Testing in Prod
  • (33:30) - Load + Fault Testing for Peak Traffic
  • (36:30) - AI’s Opportunities & Risks for Ops
  • (39:30) - Defining Scalability as Elasticity
  • (44:18) - Key Takeaways & Farewell

© Queue-it, 2025
  continue reading

7 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play