Major cloud outages across the major platforms such as AWS, Microsoft Azure and Google Cloud Platform are pretty rare, right? There’s been a few high profile regional outages such as the Azure South Central US region being taken down by thunder storms, and a major AWS S3 Service Disruption in the Northern Virginia (us-east-1) region, but global outages don’t happen all that often and if they do they’re recovered fairly quickly. Even small blips tend to shake consumer confidence though, and often these incidents are big news, especially when they take down high profile sites or services such as Uber, Reddit, or parts of Office365.
But what would happen if one of the major cloud providers were taken out for a significant period of time? How would that affect your business? Are you able to revert back to pen and paper or would you suffer significant loss of business? And how would that happen? Are we talking end of the world, zombie apocalypse type scenario?
A cyber incident that takes a top three cloud provider offline in the US for 3-6 days would result in loss estimates between $6.9 and $14.7 billion according to the “Cloud Down” report produced by Lloyds of London.
Trevor Maynard, head of innovation at Lloyd’s, said: “Clouds can fail or be brought down in many ways — ranging from malicious attacks by terrorists to lighting strikes, flooding or simply a mundane error by an employee. Whatever the cause, it is important for businesses to quantify the risks they are exposed to as failure to do so will not only lead to financial losses but also potentially loss of customers and reputation.”
And that is how the entirely silly “ZombieTrak” mobile app came into being. As there are many more business critical systems making their way into public cloud, we wanted to test the water with a multi-cloud provider serverless app that would accommodate the loss of an entire cloud platform and continue operating.
The app itself is fairly simple – a Xamarin Forms application that uses Facebook as a login provider (as a fairly open example really. We could use other social logins, use active directory federation etc but we wanted a simple example that would allow as many people as possible to just login). The app tests the availability of each provider directly (rather than through the CloudFlare Load Balancer) and displays the status on the front page.
The rest of the API calls go via the Cloudflare load balancer making use of Cloudflare’s highly available DNS infrastructure and global Anycast network. Cloudflare monitors the health of the two API endpoints and directs traffic accordingly. This was the best I could do to abstract the loadbalancing away from the cloud providers (short of trying some BGP and Anycast magic myself, which would have been complex, and expensive!). It’s also possible to do some of this via AWS Route53 or Azure Traffic Manager, but in the interests of keeping it provider agnostic and true Multi-cloud rather than making more use of one platform over another, I went with Cloudflare.
Facebook login is triggered within the mobile app and returns a token. This token is then used with both the AWS Cognito service and the Azure EasyAuth part of the Azure App Service which contains the Azure Function.
Since we’re hitting the Cloudflare loadbalancer, we don’t actually know which endpoint we’re going to hit, nor for resiliency purposes should we care. We do need to ensure that the respective auth services are able to function correctly. To do that, with Azure we can hit the /.auth/login/facebook endpoint, posting the facebook token and add the response to the X-ZUMO-AUTH headers for subsequent requests to the API.
AWS is a little more complex. We need to use the AWS API to sign the requests using SigV4, but this is fairly easy.
What we end up with is an HTTP request that contains headers required for both AWS and Azure.
The endpoints from the AWS side are served by API Gateway which will authenticate via Cognito Identity Pools to authorise the request and then passes them through to Lambda functions. Depending on the request method, the app is then either served data from DynamoDB, or places an item into an SQS queue which is then dealt with by a persistence Lambda which both inserts data into DynamoDB and places an item into the Azure Storage Queue to ensure that the two clouds are kept in sync.
The endpoints from the Azure side are straight into the HTTP(S) endpoints provided by Azure Functions. Similar to AWS, these functions, dependent on request method will either return data from CosmosDB or place an item into a storage queue. The storage queue message will then be picked up by a persistence function which will both insert data into CosmosDB and place an item into the AWS SQS queue to sync.
In conclusion, this was a really fun one to work on and prove the concept to ourselves. Using as much serverless as possible was a key concern, since resiliency costs money and by making use of this technology we were able to keep costs as low as possible rather than using some of the IaaS (Infrastructure-as-a-Service) features such as EC2 and Virtual Machines, requiring SQL Azure or RDS instances. We could also take the Multi-Cloud concept further by adding a few others such as Google Cloud Platform, DigitalOcean or IBM Cloud.
Interestingly, as Cloudflare will send alerts when one of the nodes becomes unresponsive, I have noticed that both AWS and Azure have been knocked out of the loadbalancer groups at some stage. This may have been an issue with the API layer, or the serverless functions themselves, and only lasted for a few minutes.
This was a pretty easy scenario to deal with though, and we’re sneakily skirting around the issue of syncing data updates which would require much more complex ways of keeping data consistent across cloud platforms.
We’ve explored the complexities of Multi-cloud before, and trying to implement this strategy in an environment where there is a high level of governance may have a significant management overhead, but if the stakes are high, a Multi-cloud solution could offer the scale of resiliency you need to ensure business risks are covered.