We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.
Start a Discussion
Cloud Computing

What lessons should be learned from Amazon's recent cloud outage?

Vote 0 Votes
As everyone is well aware, Amazon recently had a significant cloud outage that affected customers far and wide.  What lessons do you think should be learned from this recent outage?

7 Replies

| Add a Reply
  • Whether on premise or off premise, always have a continuity, DR plan. I have a recent experience with Mozy Cloud Backup file provider, where I learned they delete your files if you don't perform a back-up within 30 days. They assume you don't need your files anymore because you did not connect to the cloud. I lost some important files because I thought theye were safe in the Cloud, that I did not maintain locally (including personal photos of my children). So, the Cloud helps us, but don't put all your eggs in the cloud basket.

  • I agree with Jordan. I just want to add one more thought. All new developments bring new things to consider. In this case we have a new kind of complexity that the big scale cloud providers must handle in order to serve a crazy amount of requests and customers. That we have new challenges should be crystal-clear: How many persons had ever heard about a re-mirroring storm prior to the Amazon outage?

  • I agree that disaster and continuity planning are key to surviving an outage such as the one with AWS. But it also demonstrates the need for more resiliency offerings from the Cloud service providers. Right now, it's the responsibility of the consumer to handle failover and recovery for their applications. The service provider needs to offer more inherent options for automated failover given their control of the infrastructure from the hypervisor down, which consumers don't have access to.

  • First and foremost we need to understand that storage - in general - isn't often a highly available "service". We don't have storage "RAID" services that ensure redundancy and thus resiliency in storage.

    In general, we should recognize that we have a need to understand how services work when they are provided by a cloud computing provider. Black box mentality is great marketing (hey, no worries!) but in reality it’s dangerous to the health and well-being of applications deployed in such environments because you have very little visibility into what’s really going on. The failure to understand how Amazon’s Availability Zones actually worked – and exactly what constituted “isolation” aside from separate power sources as well as what constitutes a “failure” – lies strictly on the customer. Someone within the organization needs to understand how such systems work from the bottom to the top to ensure that such measures meet requirements.
    The need to implement a proper disaster recovery / high availability architecture and test it. Proper disaster recovery / high availability architectures are driven by operational goals which are driven by business requirements. A requirement for 100% uptime will likely never be satisfied by a single site, regardless of provider claims.

  • Plan for redundancy. The cloud doesn't remove one from having a recovery/availability cookbook. Its a lesson anyone who has ever worked on a mission critical application has learned the hard way.

  • Lots of good stuff on the need for a sensible business continuity plan (reminds me of the TV production company that lost years of episodes due to vandalism at a cloud storage provider), but I'm going to have to take a different tack.

    I see the Amazon outage as being the best advertisement for cloud services we've seen to date. How many IT departments do you know which would provide a written report which:

    1. explains the problem in clear language,
    2. accepts full responsibility,
    3. details the actions which will be taken to prevented the problem from occurring in the future, and
    4. compensates the departments and organisations effected.

    And all this when Amazon was already offering a higher reliability solution which was not affected (i.e. if you ran redundant zones).

    Contrast this with many internal IT departments. There was an outage at Westpac (large AU bank, and one of the top ten in the world) last week where faulty air conditioning took out the payments network for the best part of the day. All we got was a sound bite "bad air conditioner, we're sorry". You can just imagine the conversations inside the bank.

    CEO to CIO: "Please explain"
    CIO to CEO: "It was bad, sorry, won't happen again"
    CIO to Ops: "Don't let it happen again"
    Ops: "Oh well"

  • The cloud allows you the comfort to step back so you can allot more time to doing other things but it doesn't absolve you from understanding how it all works. With this important realization, you can now plan for a lot of things just as you would be doing if you were not on the cloud. I'm sure it can be a lot of things and 'planning for failure' is one of them.

    Come to think of it, the outage may have been bad but only for the short term. Otherwise, things would have been much worse because by then, many would have been afflicted with something much worse: complacency.

Add a Reply

Recently Commented On

Monthly Archives