Preface Chapter 1: Introduction A brief history What is SRE? What is in the book? SRE as a framework for new projects Summary References Chapter 2: Monitoring Why monitoring? Instrumenting an application What should we measure? A short introduction to SLIs, SLOs, and error budgets Service levels Error budgets Collecting and saving monitoring data Polling applications Nagios Prometheus Cacti Sensu Push applications StatsD Telegraf ELK Displaying monitoring information Arbitrary queries Graphs Dashboards Chatbots Managing and maintaining monitoring data Communicating about monitoring Do they even know there is monitoring? References and related reading Future reading Summary Chapter 3: Incident Response What is an incident? What is incident response? Alerting When do you alert? How do you alert? Alerting services What is in an alert? Who do you alert? Being on call Communication Incident Command System (ICS) Where do you communicate? Recovering the system Calling all clear Summary Chapter 4: Postmortems What is a postmortem? Why write a postmortem? When to write a postmortem document Carrying out incident analysis How to write a postmortem document Summary Impact Timeline Root cause Action items Postmortems without action items Appendix Blameless postmortems Holding a postmortem meeting Analyzing past postmortems MTFR and MTBF Alert fatigue Discussing past outages Summary References Chapter 5: Testing_and Releasing_ Testing What do you test? Testing code Testing infrastructure Testing processes Releasing When to release Releasing to production Validating your release Rollbacks Automation Continuous everything Summary Chapter 6: Capacity Planning A quick introduction to business finance Why plan? Managing risk and managing expectations Defining a plan What is our current capacity? When are we going to run out of capacity? How should we change our capacity? State and concurrency Is your service limited by another service? Scaling for events Unpredictable growth-user-generated content Preplanned versus autoscaling Delivering Execute the plan Architecture——where performance changes come from Tech as a profit center and procurement Summary Chapter 7: Building Tools Finding projects Defining projects RDD Example Design documents Planning projects Example Retrospectives and standups Allocation Building projects Advice for writing code Separation of concerns Long-term work Example OKRs Notebooks Documenting and maintaining projects Summary Chapter 8: User Experience An introduction to design and UX Real-world interaction design User testing Picking an experience Designing the test Finding people to test Developer experience Experience of tools Performance budgets Security Authentication Authorization Risk profile Phishing ACM code of ethics Summary References Chapter 9: Networking Foundations The internet Sending an HTTP request DNS dig Ethernet and TCP/IP Ethernet IP CIDR notation ICMP UDP TCP HTTP curl and wget Tools for watching the network netstat nc tcpdump Summary Chapter 10: Linux and Cloud Foundations Linux fundamentals Everything is a file Files, directories, and inodes Sockets Devices /proc Filesystem layout What is a process? Zombies Orphans What is nice? syscalls How to trace Watching processes Build your own Cloud fundamentals VMs Containers Load balancing Autoscaling Storage Queues and Pub/Sub Units of scale Example architecture interview Summary References Other Books You May Enjoy Index