Note: The following is a recount of an incident that occurred during my time as an SRE with a previous company. I do not have access to the original outage report and I am sharing this story based on my recollection. While I may not remember every detail with complete accuracy, the general sequence of events is described here to the best of my ability. Any errors or omissions are entirely my own. Also, for the sake of maintaining confidentiality, names of the company, product, tool, or any other identifiable information will not be included.

Ah, on-call rotations. They keep you sharp, but sometimes they test your patience in the most unexpected ways. During an oncall shift with a previous employer, a seemingly minor configuration change cascaded into a major incident, taking down an entire production cluster. Here’s the story, along with the valuable takeaways that helped us prevent similar situations in the future.

Tier 1 Networking and Missing Alerts Link to heading

The incident stemmed from a change deployed to our Google Cloud Platform (GCP) cluster. This cluster, specifically designed for a new product, was the set to use Tier 1 networking. Tier 1 networking, while cost-effective, comes with limitations on availability compared to the Standard Tier typically used for production workloads.

The change itself went smoothly, with no errors during the merge process. However, a crucial detail was overlooked: our account wasn’t configured to utilize Tier 1 networking. This mismatch remained undetected as our infrastructure-as-code (IaC) continued to spin up instances.

The consequence? As existing instances in the cluster recycled over the next ten days, attempts to launch replacements failed. GCP, adhering to the Tier 1 configuration, couldn’t provision new instances. This left the entire cluster stranded, with no pods running and services unavailable.

Alert Fatigue and the Importance of Visibility Link to heading

What truly surprised us was the lack of proper alerting during this critical downtime. While an alert did fire when individual pods failed, it got buried under the usual barrage of notifications from other services. This “alert fatigue” prevented a timely investigation.

The situation only came to light when a product owner flagged the issue to my manager. Fortunately, upon investigating the cluster, I identified the lack of available nodes and the underlying error messages pointing to the Tier 1 network incompatibility.

Resolving the Crisis and Learning from Our Mistakes Link to heading

By retracing our steps through the Git repository, we pinpointed the problematic change and promptly reverted it. With Premium Tier networking back in play, a new node pool successfully launched, bringing the cluster and its services back online.

Lessons Learned: The Power of Proactive Measures Link to heading

This incident underscored the importance of several key practices:

  1. Comprehensive Monitoring: We needed a more robust monitoring system that wouldn’t let critical infrastructure failures slip through the cracks. Detecting a large number of downed instances within the cluster should have triggered an immediate alert.
  2. Rigorous Testing: For changes impacting instance configuration, incorporating tests to verify the availability of specific instance types or network classes would have prevented this issue from reaching production.
  3. Alert Optimization: We needed to address “alert fatigue” by fine-tuning our alerting system to prioritize critical notifications and differentiate between minor hiccups and potential outages.

By implementing these learnings, we significantly improved our ability to proactively detect and address potential issues before they snowball into major incidents. This experience, while stressful at the time, proved to be a valuable lesson in the importance of robust monitoring, thorough testing, and a well-managed alerting strategy.