The Story So Far - Why Was Down For 45 Hours

Posted by rs picture rs on Wed 08 Jul, 2009 11:13:03 +0000

“Outage” - a word that comes with so much burden and disgust, especially nowadays with the advent of cloud computing, most users expect a full 24x7 uptime, regardless of the service. However, the reality of it is that even services like Google App Engine can go down. Most of these outages are down to very common events (and boy, we’ve heard our share of them!) like disk failures, security breaches, network outages and even data center fires. Hey - even lightning can strike the cloud, right ?

When disappeared off the internet on 6th July 2009 at 15:20 BST, I immediately thought that it was one of the usual reasons. However, when I realised that all servers (we have a few of them) disappeared, I began to panic. For a moment, I thought that something really bad had happened – I mean, to the extent that it was the end of the internet as we knew it.

After trying to diagnose the situation for 30 minutes or so, I called up the service providers and they basically told me that they couldn’t tell me what had gone wrong. All they could say was that their infrastructure was working fine, but they had to disconnect my servers. Apparently, the only person that had the authority to tell me what was going on had gone back home for the day, and I had to wait till the morning. I found that really odd, and began to panic even further! Was it a security breach ? Was one of my processes doing something really sick and affecting others in the data centers ? Or maybe Goblins just came out and started eating away at the data center. I even re-read their Terms of Service and Policy Notes to double check that I had not done anything “out of the ordinary”.

At around 9pm BST, I get a call from the “local authorities” (I can’t say who they are right now, but rest assured that they are valid local UK authorities that have jurisdiction in UK) saying that they wanted to visit me at home to discuss I just blew my mind at this point – what in the world happened on to make these guys visit me at home ?

It turns out that some idiotic moron a user had uploaded data on to the service that he/she was not authorised to have. This is your basic intellectual property theft case that we’re talking about here. The local authorities had to take all the server hard drives for examination, and I was told that someone will be in contact with me the following day (i.e. 7th July 2009).

The following day, I was on the phone trying to get them to speed things up, but to no avail. Apparently everyone was trying their very best. Later in the day, I did get a call that mentioned that the hard disks will only be returned to the data center the following day (i.e. today).

This morning at around 9am BST, the local authorities visited me at home. We got everything sorted out and the service was brought online at around 12noon BST.

The main issue here is that this case of IP theft is an ongoing investigation, and I really couldn’t tell you guys anything at all. In fact this whole blog post is the only amount of information I can let out even at this point.

Hell, I hate myself for doing that to you. It totally goes against every single grain of ethical business practices that I’ve grown to adhere to and love.

A 45 hour outage is inexcusable. But this is one of those WTF moments that I just have to take in and suffer with my beloved users. It is really uncommon for any service on the internet to go through this sort of “experience”. Having said that, any service on the internet is exposed to this risk where certain users upload/share information that they do not own.

There will be some changes to in the coming weeks to avoid the lengthy delays that the authorities took to return the hard drives. In fact at one point yesterday, I was contemplating to disable creation of new repositories for Free users, but then, two minutes later, immediately retracted from the idea thinking “Why should thousands upon thousands of users get affected due to one user's silly actions?”.

The one thing that I will definitely do is bringing the servers closer to home (UK). It will require purchasing some hardware and the co-location costs, but I think it will be a worthwhile investment – for you and for me. In fact, from the quotations that I’m looking at, the new servers should be faster (which is always a good plus point).

I do apologise for the prolonged outage, but I hope you do understand that a lot of it was out of my control – I just couldn’t pull off a Chuck Norris and get those hard drives back, now, could I ? :)

Everything should be back to when it was taken offline on 6th July 2009. If there are any questions, please do put them in the comments below, or just raise a support ticket.

View 4 comments


rs picture

rs on Wed 08 Jul, 2009

At some point yesterday I was really, really wishing that it was the data center exploding – would have been much easier to explain.

rs picture

rs on Wed 08 Jul, 2009

@eduramiba – I missed one DNS entry in putting things back. Do give it another try. The DNS entry might take an hour to propagate though.

vern picture

vern on Fri 10 Jul, 2009

rs picture

rs on Fri 10 Jul, 2009

@vern : Uh huh – that pretty much sums it all.

Here’s a clickable link to the Bloomberg article


You do not have sufficient permissions to comment

Blog Entry Options

Blog Options

Blog Archives


Blog Entry and Comments