June 20, 2010 · by weppos

We’re currently in the middle of a very long downtime. So far, the server has been unavailable for 930 minutes, which is an unbelievable amount of time, I know. Let me explain what happened.

RoboDomain is currently a free service, but we’re trying to do our best to be as reliable as possible. Unfortunately, this time we failed.

Yesterday we received an email from VPS.net, the current RoboDomain server provider, telling us they were going to move our VPS to a new location, sometime in the next 11 days. In order to prevent unexpected downtime — they said — they also strongly encouraged all VPS owner to start the migration their own. Quoting from their email:

Each move should take a matter of minutes (but you could be queued waiting to do the move), and will require no configuration changes on your end at all.

Unfortunately, this didn’t happen to be true for RoboDomain VPS. The VPS was shut down at 8.30 pm CET and, is still offline. Apparently, the server has been migrated but we can’t access it via SSH and the web console is returning a weird boot error.

The migration have had issues since the beginning. First we had to wait about one hour in the migration queue but, fortunately, the server was still online at that time. Then, the server was shut down but the migration didn’t start before 3-4 hours. When the migration finished, the server wasn’t the reboot fail and the server is still offline, after more than 12 hours.

You can read part of the conversation on Twitter.

There are no words to explain how much I’m sorry for this downtime. I’m absolutely aware that a good service is measured not only in terms of features, but also for its reliability.

We had serious issues with VPS.net more than one in the past months — especially long time downtimes without any kind of service monitoring or failover prevention, exactly the kind of issue we’re experiencing right now (and we’re not alone) — which let us understand they aren’t the kind of partner we want to trust for a hosting service. We can’t trust a provider that, more than once, performed maintenance tasks without being able to watch the status of affected servers and take actions on failed tasks.

Also, they never did anything else than apologizing. But when you experience more than 12 hours of downtime for 2 times in the last 3 months plus additional small downtimes, excuses are not enough.

We have been planning to migrate RoboDomain to a new provider since a few months, but we always postponed the task assigning higher priority to feature development. Today’s issue has been the last straw and we’re now managing to move RoboDomain away from VPS.net in the next following weeks. I’ll keep you updated on the status.

Thank you for your patience and, again, I’m really sorry for what happened.

UPDATE 01:21 pm (CET): the kernel issue has been fixed, the server is now up and running.