Amazon recently posted their response to the outage that hammered their EC2 platform lately. It would seem that the outage itself was triggered by a piece of network maintenance that was not carried out properly, which in turn triggered a rather catastrophic chain of events within the custom Amazon systems. Ultimately, it resulted in data loss, as well as down time, for many customers – such as Heroko who posted their own post-mortem of the incident here.
Microsoft Azure also suffered problems recently – with parts of their system becoming unavailable. The first in March was blamed on an OS upgrade that went awry, then in April there was an Azure Storage outage – for which I’ve not actually seen any real detail on the cause (if anyone has a link, please point me to it – I’d love to know what happened). However, I think the stark contrast between these two vendors is the transparency and information given – both at the time and after the fact.
Amazon have gone the whole hog, totally admitting the fault, identifying exactly (in full Technicolor) the issues that occurred and have resolved themselves to – publically – fix it. And they have issued a decent amount of compute time refund. Microsoft? Well, I’ve not heard of any refunds – even partial ones – for the outages that occurred on their platform. I’ve also not heard of any refunds related to outages on another of their cloud platforms – Business Productivity Online Suite – either, which has had it own problems of late. So is using cloud technology too risk? In a nutshell, no, as long as you are sensible. I can’t say that I would advocate putting everything in the cloud unless its totally stateless and can operate if any SQL instances etc disappear. If you need to store state, or anything really sensitive, I still prefer the hybrid model, but I guess that because they need to do more to convince me that they are as secure as they proclaim to me. The biggest fault with people using clouds to date and suffering outages is quite simply education. They have put applications up into the cloud and expect them to be highly available. That’s not the case. Unfortunately you still need to understand the requirements of highly available design, and be sure to implement them – including setting your application up in different zones / regions – and ideally, different geographical locations! If you don't, all you really are doing is running a small cluster after all. I know that many people will be screaming about the EC2 outage in particular where this was caused by human error. But I’d love to see them do better in their own data centre. Human error occurs everywhere, but where do you think the resources (i.e. skills AND money) are to mitigate them better? On premise with yourself, or out in a cloud?