How Google Runs Production Systems
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
SRE#
Google#
software_engineer#
Monitoring#
Data#
اکثر طول عمر یک سامانه نرمافزاری نه صرف طراحی یا پیادهسازی، بلکه صرف استفاده از آن میشود. پس چرا دیدگاه رایج در مهندسی نرمافزار، تمرکز اصلی را بر طراحی و توسعه سیستمهای محاسباتی بزرگمقیاس قرار میدهد؟
در این مجموعه از مقالات و نوشتهها، اعضای کلیدی تیم Site Reliability Engineering (مهندسی قابلیت اطمینان سایت) گوگل توضیح میدهند که چگونه و چرا تعهد آنها به تمام چرخه عمر نرمافزار باعث شده بتوانند برخی از بزرگترین سیستمهای نرمافزاری جهان را با موفقیت طراحی، اجرا، پایش و نگهداری کنند. در این کتاب، اصول و روشهایی را خواهید آموخت که به مهندسان گوگل کمک کرده تا سیستمهایی مقیاسپذیرتر، قابلاطمینانتر و کارآمدتر بسازند—درسهایی که میتوانند مستقیماً در سازمان شما نیز به کار گرفته شوند.
این کتاب مجموعهای از مقالههاست که توسط اعضا و فارغالتحصیلان تیم مهندسی قابلیت اطمینان سایت گوگل نوشته شدهاند. ساختار آن بیشتر شبیه به مجموعهای از مقالات کنفرانسی است تا یک کتاب سنتی با نویسندهای واحد یا گروه کوچکی از نویسندگان. هر فصل به گونهای نوشته شده که بخشی از یک کل منسجم را تشکیل دهد، اما مطالعه جداگانه فصلها، بر اساس علاقهمندی خاص شما نیز بسیار مفید است. اگر مقالههایی وجود داشته باشد که متن را پشتیبانی یا تکمیل کند، به آنها ارجاع دادهایم تا در صورت تمایل پیگیری کنید.
برای شروع، پیشنهاد میکنیم حداقل فصلهای ۲ و ۳ را مطالعه کنید که به ترتیب محیط تولیدی گوگل و دیدگاه SRE نسبت به ریسک را توضیح میدهند (ریسک، تا حد زیادی، ویژگی کلیدی این حرفه است). خواندن کتاب بهصورت کامل نیز ممکن و مفید است؛ فصلها بهصورت موضوعی دستهبندی شدهاند: اصول (بخش دوم)، عملیات (بخش سوم)، و مدیریت (بخش چهارم). هر بخش با مقدمهای کوتاه شروع میشود که توضیح میدهد هر فصل درباره چیست و به مقالههای مرتبط دیگر از تیم SRE گوگل ارجاع میدهد. همچنین یک وبسایت مکمل برای کتاب معرفی شده که منابع مفیدی در اختیار شما قرار میدهد.
امیدواریم مطالعه این کتاب برای شما به اندازهی گردآوری آن برای ما، مفید و جالب باشد.
— ویراستاران
The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?
In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient―lessons directly applicable to your organization.
This book is divided into four sections:
This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. (If there are other articles that support or inform the text, we reference them so you can follow up accordingly.)
You don’t need to read in any particular order, though we’d suggest at least starting with Chapters 2 and 3, which describe Google’s production environment and outline how SRE approaches risk, respectively. (Risk is, in many ways, the key quality of our profession.) Reading cover-to-cover is, of course, also useful and possible; our chapters are grouped thematically, into Principles (Part II), Practices (Part III), and Management (Part IV). Each has a small introduction that highlights what the individual pieces are about, and references other articles published by Google SREs, covering specific topics in more detail. Additionally, there’s a companion website mentioned in the book that has a number of helpful resources.
We hope this will be at least as useful and interesting to you as putting it together was for us.
— The Editors.
Table of Contents
Part I. Introduction
Chapter 1. Introduction
Chapter 2. The Production Environment at Google, from the Viewpoint of an SRE
Part II. Principles
Chapter 3. Embracing Risk
Chapter 4. Service Level Objectives
Chapter 5. Eliminating Toil
Chapter 6. Monitoring Distributed Systems
Chapter 7. The Evolution of Automation at Google
Chapter 8. Release Engineering
Chapter 9. Simplicity
Part III. Practices
Chapter 10. Practical Alerting from Time-Series Data
Chapter 11. Being On-Call
Chapter 12. Effective Troubleshooting
Chapter 13. Emergency Response
Chapter 14. Managing Incidents
Chapter 15. Postmortem Culture: Learning from Failure
Chapter 16. Tracking Outages
Chapter 17. Testing for Reliability
Chapter 18. Software Engineering in SRE
Chapter 19. Load Balancing at the Frontend
Chapter 20. Load Balancing in the Datacenter
Chapter 21. Handling Overload
Chapter 22. Addressing Cascading Failures
Chapter 23. Managing Critical State: Distributed Consensus for Reliability
Chapter 24. Distributed Periodic Scheduling with Cron
Chapter 25. Data Processing Pipelines
Chapter 26. Data Integrity: What You Read Is What You Wrote
Chapter 27. Reliable Product Launches at Scale
Part IV. Management
Chapter 28. Accelerating SREs to On-Call and Beyond
Chapter 29. Dealing with Interrupts
Chapter 30. Embedding an SRE to Recover from Operational Overload
Chapter 31. Communication and Collaboration in SRE
Chapter 32. The Evolving SRE Engagement Model
Part V. Conclusions
Chapter 33. Lessons Learned from Other Industries
Chapter 34. Conclusion
Appendix A. Availability Table
Appendix B. A Collection of Best Practices for Production Services
Appendix C. Example Incident State Document
Appendix D. Example Postmortem
Appendix E. Launch Coordination Checklist
Appendix F. Example Production Meeting Minutes
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.
Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer.
Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.