prosimo-dark-logo

Thinking about Disaster Recovery in the cloud? Don’t forget about the network!

Disaster Recovery (DR) is integral to an organization’s business continuity plans. For those with applications in the cloud, implementing an effective DR strategy is critical, especially for the organization’s overall risk management plan.  When organizations think about DR, they primarily focus on ensuring that corporate data is secure and replicated across regions in the cloud while remaining compliant. However, when the network that enables users access to that data is not performing or fails, it impacts the organization’s overall operations.

The right way to think about this is to decouple the application from network infrastructure and focus on building a highly available and fault-tolerant cloud network transit. It is essential to consider some items:

True cloud-native architecture:

  • Many cloud network solutions claim to be cloud-native but are designed using traditional data center technologies with virtual appliances.
  • Some cloud network solutions are delivered as-a-service (mid-mile offering) that is suited for small-scale deployments. Medium to large enterprises need control over how the DR requirements are implemented as the network scales or complexity grows.
  • A proper cloud-native network solution is built using modern application architectures and can dynamically and cost-effectively meet network scaling requirements. Even more important, they allow enterprises complete control to ensure they can maintain data sovereignty and the network can meet its DR needs.

 

Cloud Transit’s fault-tolerance:

  • The cloud network solution should handle disruptions in the cloud seamlessly. From flapping network links to outages in cloud regions, the cloud network transit should always be available to deliver traffic to the application regardless of location.

 

Consistent user experience:

  • Application locations will change in the cloud following the execution of a DR plan; the cloud network should be able to deliver a consistent user experience regardless of disruptions within the network or where the application is.
  • A transit that understands application patterns.
  • A transit that understands application behavior, the layer at which the application is connected (L3, L4, or L7), and the impact of an outage. The insights about the application help it build layers of fault tolerance.

 

Let’s explore how Prosimo’s autonomous cloud network platform helps organizations build a highly available network that meets the requirements above. The platform uses a combination of proactive monitoring through heartbeat messages and DNS programming, and we discuss them in detail below.

Reader Note

Prosimo Edge gateways are highly available network stacks that operate as an “active-active” element within the Prosimo fabric.  For clarity we have termed two Prosimo Edges are “primary” and “secondary” in illustrating routing after an application failure.  This should not imply that a Prosimo Edge is sitting dormant until a failure occurs.

Scenario

Acme, Inc. deployed a business-critical application, app.acme.com, in the US-West region. As part of their DR plan, another set of instances running the same application are deployed in the US-East. Employees who access this application are based in the Philippines. Also, this application is accessed by other applications, app2.acme.com and app3.acme.com, which fetch data used for analytics within Acme. Acme deployed the Prosimo cloud network platform within their cloud environment using the architecture below.
Figure 1: Acme's cloud network architecture using Prosimo

Architecture details:

  • The application is deployed in instances across AWS US-West and US-East regions.
    • Instances in the US-West region serve as the active application endpoint while US-East is the backup.
  • Prosimo Edge gateways have also been deployed in AWS US-West and US-East regions and connect to instances deployed in the respective regions through an AWS Transit Gateway (TGW).
    • The Edge gateways serve as primary and secondary Edges for the application in US-West and US-East regions.
    • The Edge gateways establish cross-regional cloud-native peering to connect US regions. Acme leverages this connectivity to replicate data across the instances.
  • Acme also deployed 2 Prosimo Edge gateways in the AWS AP-Southeast region to provide quick access to the cloud network for their employees.
    • Both Edge gateways are deployed in AP-Southeast-1 and AP-Southeast-3 regions, respectively.
  • Prosimo Edge gateways create a fabric by establishing cloud-native peering over the AWS backbone.
    • Edges are configured with information on how to reach the application instances deployed in their respective regions.
    • Secondary Edges are configured with information on how to reach primary Edges.
  • Prosimo Edges contain reverse-proxy capabilities, sitting in the data path between users and applications.

Prosimo’s autonomous cloud-network platform addresses Acme’s DR needs within the network in 2 ways:

  • Edge gateway resiliency
  • Platform resiliency

Edge gateway resiliency

The Edge gateways serve as gatekeepers and sit in the path between the employees and applications. It’s critical that they meet Acme’s fault-tolerance requirements, as any failures within any single gateway can potentially render the application unavailable. Prosimo Edge gateways are always deployed in multiple availability zones such that if one zone becomes unavailable, the other is available to handle the application traffic.

Prosimo built the Edge gateway using a cloud-native design: it runs as a cluster of microservices within Kubernetes and inherits the fault tolerance mechanisms of K8s. Within each Edge gateway, multiple worker nodes handle application traffic which is managed and constantly monitored by a controller node. If one worker node fails, traffic is re-routed to other worker nodes while another node is instantiated to replace the one that failed. Also, because traffic spikes could occur occasionally, additional worker nodes are automatically instantiated to accommodate this change in the traffic pattern.

Each worker node contains several pods which provide security, optimization, visibility, and other services. Live monitoring of these pods occurs within the worker node to detect and quickly recover from any faults that may occur.

Figure 2: internal architecture of Prosimo's Edge gateway in Acme's cloud network
With this multi-layered, fault-tolerant design, Prosimo Edge gateways ensure the availability of Acme’s application and quickly recover from any disruption to the network.

Platform resiliency

The platform uses heartbeat messages, DNS, and software-defined networking concepts to maintain a resilient cloud network for organizations. Heartbeat messages are used throughout the fabric to continuously monitor the Edges and applications. A centralized controller exchanges heartbeat messages with the Edges, and they exchange heartbeat messages between themselves to proactively detect and workaround failures. Edge gateways deployed on the platform are assigned a unique hostname; the platform updates DNS with this record to attract network traffic from other Edges and application users. The FQDN of each Edge is dynamically updated in DNS to allow for a seamless redirection of traffic to alternate paths when a failover occurs.

As the application is critical to the business, Acme requires a network that can work around unplanned disruptions to ensure their employees in the Philippines always have access. Prosimo addresses this requirement in 2 ways:

  • Application failover
  • Edge network failover

Application failover

As mentioned earlier, Acme deployed the application in instances across US-West and US-East Regions, with the instances in the US-West region serving as the active application with the US-East instance serving as backup. Prosimo Edge gateways have been deployed in both regions where instances are located. The controller configures the primary Edge with details of the secondary Edge. It uses this information to determine where to redirect active traffic when failover is required.

The steps below describe how failover occurs:

  1. Failover to the backup application begins after the active application is marked as unresponsive: the active app has failed to respond to successive heartbeat messages from the primary Edge gateway.
  2. The secondary Edge gateway will allow traffic flow to the backup application.
  3. The primary Edge gateway will forward live traffic to the secondary Edge at 10.20.20.1.
    1. It will determine the hostname of the secondary Edge (pEdge2.acme.prosimoedge.io) from its route table and retrieve the IP address from DNS. This lookup occurs before a failover event to ensure a speedy transition between primary and secondary Edges.

The application will now be reachable in the US-East region through the secondary Edge.

Note: If Acme had users in the US-West region, they would have connected to the application through the primary Edge in that region. When the failover occurs, the primary Edge will forward traffic from users in that region to the backup application through the secondary Edge in the US-East region.

Edge network failover

As shown earlier, Acme deployed additional Prosimo ingress Edges in AP-Southeast-1 and AP-Southeast-3, extending the platform to the regions closest to the employees in the Philippines.  Acme uses these Edge gateways to provide a consistent and optimized user experience at all layers of the networking stack (i.e., using cloud backbone, TCP optimization, caching, etc.) and function as ingress points for securely connecting users to the application.

Edge network failover in Acme’s network is addressed in these scenarios:

  • Between primary and secondary Edge gateways.
  • Between employees and ingress Edge gateways.

The relationship between primary and secondary Edge gateways

All Edge gateways exchange heartbeat messages. When the primary Edge gateway detects an unavailable application instance, they signal failover execution to the secondary Edge through heartbeat messages. A similar signal is seen when the primary Edge gateway is unavailable, which may happen after a regional outage.

In Acme’s network, the platform controller programs the secondary Edge gateways with information about the application’s primary Edge gateway. Both secondary Edges failover to the primary Edge in a similar fashion.

  1. They mark the active primary Edge at IP 10.10.10.1 unresponsive after successive missed heartbeat messages.
  2. They query DNS for the IP address of the backup Edge gateway.
    1. They look up the primary Edge gateway details in their application route table, i.e., pEdge2.app.acmeprosimo.io, and query DNS for this hostname.
  3. They forward traffic to the backup Edge at IP 10.20.20.1.

 

As the backup, primary Edge gateways also mark the active primary Edge as unresponsive; it would have begun allowing traffic to flow to the backup instances of the application.

Between source endpoints and ingress Edges

As mentioned previously, Prosimo’s Edge gateways contain reverse-proxy capabilities which allow them to receive and forward traffic on behalf of users and applications. The platform uses simple DNS configurations to insert itself in the path for authentication/authorization of users and proxied access to applications. DNS is also used to quickly redirect user traffic to an alternate Edge gateway when a failover is required.

In Acme’s network and applications in the AP-Southeast region, app2.acme.com and app3.acme.com access the business-critical application through the secondary Edge gateways sEdge1 and sEdge2. Both applications perform similar functions for Acme, and if an unplanned disruption occurs within the AP-Southeast-1 region or with gateway sEdge1, app3.acme.com will continue to fetch data from app.acme.com through gateway sEgde2.

The employees in the Philippines access the application through the secondary Edge gateway(sEdge1) deployed in the AP-Southeast-1 region. The application hostname, app.acme.com, has a CNAME record assigned to Prosimo and is reachable through the IP address of sEdge1, 200.198.196.194.

Where an unplanned disruption occurs, causing the Edge gateway in the AP-Southeast-3 region(sEdge1) to become unavailable, the below steps describe how the platform quickly detects the disruption and seamlessly redirects the employee traffic to the Edge gateway in AP-Southeast-3, sEdge2.

  1. The secondary Edge gateway in the AP-Southeast-1 region, sEdge1, becomes unavailable following an unplanned disruption in the cloud region.
  2. The platform controller marks sEdge1 as unresponsive.
    1. Heartbeat messages are exchanged regularly between the Edge gateway and the controller to determine the gateway’s availability. As a result of the outage in the region, the controller received no responses to the successive heartbeat messages sent to the Edge.
  3. The controller modifies the DNS table to ensure traffic redirection to sEdge2. . The application CNAME entry is modified to the hostname of the Edge gateway deployed in the AP-Southeast-3 region, i.e., sEdge2.appacme.prosimo.com.
  4. Employee traffic for application app.acme.com is forwarded to sEdge2.

 

Note: In the unlikely event that no secondary Edge gateway is available in Acme’s network, the platform controller will redirect traffic to the primary Edge gateway of the application. The procedure is similar to the process described above. The only change is in step 3, where the CNAME entry is modified to the hostname of the primary Edge gateway deployed in the US-West region.

Figure 6: Traffic flows where all secondary Edge gateways are unavailable

Using the Prosimo platform, Acme can build a highly available and fault-tolerant cloud network architecture that will meet their DR expectations and mitigate the impact on business operations. Proactive monitoring at multiple areas and seamless traffic redirection using DNS ensure the application is always accessible through the network as users’ experience remains consistent.

Reach out to the team here at Prosimo to learn more about building a cloud network that meets your disaster recovery needs.