Avoiding “The 5 Stages of Networking Grief” during CSP Outages

 The internet and its SaaS platforms have become so routine, so predictably reliable overall, it is almost impossible to believe an outage will occur until it actually happens. Cloud Service Providers (CSP) and their multi-region, multi-availability zone redundancy with auto-everything appear – on its abstracted surface – almost impervious to failure. Our increasing comfort with the public cloud has, in some ways, created a normalcy bias that can catch us off-guard and scramble to respond. Despite our faith in the public cloud, no individual, process, or device has shown itself to be infallible.

Your network or the cloud, the pain is the same

While it was many years ago, I still vividly remember making a firewall change on a device that was located about 45 minutes from anything. The local technician had already departed after powering everything up and ensuring I had an accessible IP address. Sometime thereafter I fat-fingered a command that resulted in a dead CLI prompt. That’s when I learned about the Five Stages of Network Grief: you go from denial – while funneling anger, bargaining, and depression into the act of furiously slamming the “Enter” key – to acceptance where you pick up the phone to find someone nearby with a blue cable. Human error or not, I’m sure multiple people – AWS and their subscribers – were furiously slamming the “Enter” key this week trying to get the network back online.

While we now have some indication of root cause, no one waited to start the finger pointing or for marketing ambulance chasers to claim that if you were on their network or using their platform, you wouldn’t have had an outage. Of course, there’s no fine print stating, “you wouldn’t have had an outage today.” As Chuck Palahniuk wrote in Fight Club, “On a long enough timeline, the survival rate for everyone drops to zero.” Everyone takes an outage at some point.

You don’t just move on, resiliency is needed

This does not mean that we simply chalk this up as “routine” and move on as suggested here by Spencer Soper at Bloomberg nor is it saying “multi-cloud” five times in front of a mirror. Despite the feel-good redundancies built into a CSP infrastructure, it is still our responsibility to ensure that we architect resiliency into the services we deploy – on-prem or off. Reducing blast radius is the motivation behind why many companies now have or are developing a multi-cloud strategy.

Due to my time in network security engineering and operations, my experience (and bias) has me believing that a majority of the multi-cloud, fault tolerance burden lands on Network. I can log in to any cloud and start spinning up server after server with a series of mouse clicks. Ask me to provision highly available and secure transport to those resources…I’m going to need more time. A lot more. There’s no “auto-scale” tick box next to the network config. After watching compute infrastructure receive numerous upgrades and abstractions over the past 20 years, the network has not evolved that much. There are new protocols, but it is still difficult to escape the lock-in of the OSI model. Developing and executing a highly available and resilient network strategy across multiple clouds adds complexity to an already complex problem – the multiple protocol layers result in blind spots that often only expose themselves when you are trying to get your applications back online.

Why can’t we automate our way out of an outage?

But what if you did have an upgraded network, especially on a day like today? What if that network not only identified the outage in AWS but, also, automatically steered network traffic to the next best region or cloud where you had built-in fault tolerance especially for outages like the one this past week? Applications continue to function and function well despite the loss of a single region. Now I am mindful of the fact that I just called out the outage ambulance chasers – I’m not here to tell you that Prosimo would have prevented you from taking an outage in AWS today. It’s still on you to create appropriate fault tolerance into your architecture (Matt Asay has a great write-up on this at the Tech Republic here). What I will say is that if you did architect with fault tolerance in mind, Prosimo would have been alerted to the degradation of network availability in AWS US East while your network dynamically adapted to ensure that your applications remained available globally.

How do you architect for fault tolerance in the cloud?

Prosimo changes conventional network paradigms by deploying an autonomous single, or multi-cloud network fabric in minutes. You may then deterministically alter that fabric – manually or using our ML-generated suggestions – within minutes. Not hours, not days. Need to provision a new region? Need to tear one down? The use-case doesn’t matter, everything can be done with just a few mouse-clicks and within a few minutes.

It would be great if you could stop there, but you have to do more than just connectivity, you also need to secure your infrastructure and protect your business. You need to include granular, zero-trust access policies and adaptive security that monitors all sessions for anomalies and risk factors from session setup to completion. With just a few clicks of a mouse Prosimo immediately starts to deliver. Need more or less application performance? Balance cost against performance using best-path, or best-price routing across the MCN fabric. You may also leverage our pre-built caching policies to dynamically bring content within any cloud, to the cloud edge closest to your users. The network now operates at the speed of your DevOps compute and services infrastructure.

Cloud operations can be different...

That’s a great plug for Day 0, back let’s get back to the Day N+1 issue of a CSP outage. Prosimo delivers rich, network telemetry to the cloud, within the cloud, and across clouds. That telemetry includes application response time. This allows you to differentiate between network latency and jitter and an unresponsive application. This means no more sweating it out over a CLI doing TCP dumps on multiple virtual appliances than analyzing them in Wireshark. That’s painful and that approach costs you both time and money. We need to have easily consumable data that empowers us with understanding as well as a timely response. We also should not have to leverage multiple tools for that: getting multiple operations teams on a call to check all of their in-path, virtual devices during an outage window will never be an efficient exercise. Prosimo empowers NOC and SOC teams with the ability to pinpoint problems while your autonomous cloud network adapts and overcomes the problem. This is no different than the virtual machine set that auto-scales based on load, but it is now possible and available from the network.

This is what we mean by autonomous multi-cloud networking – you gain autonomy from traditional network paradigms that create drag on business priorities. With Prosimo you can stop playing the operational blame game and start delivering better time to market with a lower operational overhead. This is, after all, what the business truly wants from its IT teams.

You inherit the “shared responsibility” model as soon as you sign-on to public cloud, yet when an outage occurs, you realize very quickly how much of the fault tolerance responsibility falls on your shoulders. Your customers – internal and external – don’t care about a single region of AWS being down while the rest of the internet is operating normally. You may be able to do this on your own, but I recommend that you do everything you can to avoid the 5 Stages of Network Grief. Whether it’s Prosimo or someone else, it doesn’t hurt to accept a little help when it comes to cloud networking – you won’t have the option of calling someone with a blue cable when things go wrong.