Outages Happen

internetisbroken

On August 10, the main data center for Exchange Defender suffered a catastrophic power failure. Clients using the spam filtering service had their incoming and outgoing mail interrupted or massively slowed for most of a day, and hosted mailboxes went offline for as long as two days.

On August 17, Microsoft Office 365 had an outage after an unspecified failure in a major North American server center. Mail was down for 2-3 hours for some subscribers.

On August 7, power equipment failures took a portion of Amazon’s EC2 cloud computing platform in Europe offline for 24-48 hours.

On March 1, 35,000 Gmail users were offline for as long as five days, while Google scrambled to restore data lost in a server snafu.

In April mail went out for some of Apple’s MobileMe customers.

On August 3, many Yahoo Mail users were offline for more than a day.

Are you worried about the reliability of cloud services and hosted mail? You’re missing the point.

In a client’s office earlier this year, the mail went down for two days when the onsite Exchange server threw a fit, causing me to work frantically with Microsoft support engineers to bring the database back online.

In a client’s office a few months ago, the mail went down for two days when the server failed unexpectedly and had to be replaced with new hardware and restored from backups.

In a client’s office this spring, the mail went down for almost a week when the company’s Internet connection went down after some miscommunications with AT&T.

In a client’s office yesterday, the mail went down overnight after I made a typing mistake in the company’s MX record during a mail migration.

Outages happen.

Very small businesses probably don’t have the budget for redundant Internet connections or failover servers. There are only a few things that can be done by a very small business to prepare for an interruption in business-critical services – choose cloud platforms carefully; buy good equipment; have good backups.

The most important thing, though, is simply to recognize that outages happen. There is no technology that is capable of being delivered 100% of the time, 24×7, 365 days/year. Try to get past anger, try to get past blame, because outages are going to happen when there’s no one to blame and no one who deserves the anger.

The big companies are building a very resilient infrastructure for cloud services. The work continues to harden them. Don’t draw too many conclusions when you see a report about the outage of the moment.