Collective Wisdom from the Experts
Emil Stolarsky, Jaime Woo

#97_Things
#SRE
#Site_reliability_engineering
Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.
Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provoking questions that drive the direction of the field.
Some of the 97 things you should know:
Table of Contents
Part I. New to SRE
Chapter 1. Site Reliability Engineering in Six Words
Chapter 2. Do We Know Why We Really Want Reliability?
Chapter 3. Building Self-Regulating Processes
Chapter 4. Four Engineers of an SRE Seder
Chapter 5. The Reliability Stack
Chapter 6. Infrastructure: It's Where the Power Is
Chapter 7. Thinking About Resilience
Chapter 8. Observability in the Development Cycle
Chapter 9. There Is No Magic
Chapter 10. How Wikipedia Is Served to You
Chapter 11. Why You Should Understand (a Little) About TCP
Chapter 12. The Importance of a Management Interface
Chapter 13. When It Comes to Storage, Think Distributed
Chapter 14. The Role of Cardinality
Chapter 15. Security Is like an Onion
Chapter 16. Use Your Words
Chapter 17. Where to SRE
Chapter 18. Dear Future Team
Chapter 19. Sustainability and Burnout
Chapter 20. Don't Take Advice from Graybeards
Chapter 21. Facing That First Page
Part II. Zero to One
Chapter 22. SRE, at Any Size, Is Cultural
Chapter 23. Everyone Is an SRE in a Small Organization
Chapter 24. Auditing Your Environment for Improvements
Chapter 25. With Incident Response, Start Small
Chapter 26. Solo SRE: Effecting Large-Scale Change as a Single Individual
Chapter 27. Design Goals for SLO Measurement
Chapter 28. I Have an Error Budget- Now What?
Chapter 29. How to Change Things
Chapter 30. Methodological Debugging
Chapter 31. How Startups Can Build an SRE Mindset
Chapter 32. Bootstrapping SRE in Enterprises
Chapter 33. It's Okay Not to Know, and It's Okay to Be Wrong
Chapter 34. Storytelling Is a Superpower
Chapter 35. Get Your Work Recognized: Write a Brag Document
Part Ill. One to Ten
Chapter 36. Making Work Visible
Chapter 37. An Overlooked Engineering Skill
Chapter 38. Unpacking the On-Call Divide
Chapter 39. The Maestros of Incident Response
Chapter 40. Effortless Incident Management
Chapter 41. If You're Doing Runbooks, Do Them Well
Chapter 42. Why I Hate Our Playbooks
Chapter 43. What Machines Do Well
Chapter 44. Integrating Empathy into SRE Tools
Chapter 45. Using ChatOps to Implement Empathy
Chapter 46. Move Fast to Unbreak Things
Chapter 47. You Don't Know for Sure Until It Runs in Production
Chapter 48. Sometimes the Fix Is the Problem
Chapter 49. Legendary
Chapter 50. Metrics Are Not Slls (The Measure Everything Trap)
Chapter 51. When SLOs Attack: Pathological SLOs and How to Fix Them
Chapter 52. Holistic Approach to Product Reliability
Chapter 53. In Search of the Lost Time
Chapter 54. Unexpected Lessons from Office Hours
Chapter 55. Building Tools for Internal Customers that They Actually Want to Use
Chapter 56. It's About the Individuals and Interactions
Chapter 57. The Human Baseline in SRE
Chapter 58. Remotely Productive or Productively Remote
Chapter 59. Of Margins and Individuals
Chapter 60. The Importance of Margins in Systems
Chapter 61. Fewer Spreadsheets, More Napkins
Chapter 62. Sneaking in Your DevOps Deliciously
Chapter 63. Effecting SRE Cultural Changes in Enterprises
Chapter 64. To All the SREs I've Loved
Chapter 65. Complex: The Most Overloaded Word in Technology
Part IV. Ten to Hundred
Chapter 66. The Best Advice I Can Give to Teams
Chapter 67. Create Your Supporting Artifacts
Chapter 68. The Order of Operations for Getting SLO Buy-In
Chapter 69. Heroes Are Necessary, but Hero Culture Is Not
Chapter 70. On-Call Rotations that People Want to Join
Chapter 71. Study of Human Factors and Team Culture to Improve Pager Fatigue
Chapter 72. Optimize for MTTBTB (Mean Time to Back to Bed)
Chapter 73. Mitigating and Preventing Cascading Failures
Chapter 74. On-Call Health: The Metric You Could Be Measuring
Chapter 75. Helping Leaders Prioritize On-Call Health
Chapter 76. The SRE as a Diplomat
Chapter 77. The Forward-Deployed SRE
Chapter 78. Test Your Disaster Plan
Chapter 79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
Chapter 80. The Power of Uniformity
Chapter 81. Bytes per User Value
Chapter 82. Make Your Engineering Blog a Priority
Chapter 83. Don't Let Anyone Run Code in Your Context
Chapter 84. Trading Places: SRE and Product
Chapter 85. You See Teams, I See Product
Chapter 86. The Performance Emergency Fund
Chapter 87. Important but Not Urgent: Road maps for SREs
Part V. The Future of SRE
Chapter 88. That 50% Thing
Chapter 89. Following the Path of Safety-Critical Systems
Chapter 90. Applicable and Achievable Static Analysis
Chapter 91. The Importance of Formal Specification
Chapter 92. Risk and Rot in Sociotechnical Systems
Chapter 93. SRE in Crisis
Chapter 94. Expected Risk Limitations
Chapter 95. Beyond Local Risk: Accounting for Angry Birds
Chapter 96. A Word from Software Safety Nerds
Chapter 97. Incidents: A Window into Gaps
Chapter 98. The Third Age of SRE
Emil Stolarsky is a site reliability engineer, who previously worked on caching, performance, & disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He is the program co-chair for SREcon EMEA 2019 and SREcon Americas West 2020, and contributed a chapter to the O’Reilly book “Seeking SRE.”
Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He spent three years as a molecular biologist, before working at DigitalOcean, Riot, and Shopify, where he launched the engineering communications function.









