The Curse Of Downtime: What Happens When Your Server Fails?

What happens when your server fails?

You don’t have a spare server in the closet.

That’s really the important part. You’d have more options if you had two identical servers and kept one in the closet for emergencies. But you didn’t buy two identical servers. No one does. You’ve got one server that does everything for your business.

Let’s set the stage. I’m going to generalize about small businesses with one or two servers and 3-20 employees. You’ve come in on Monday morning and the server is down. The power won’t turn on, or there’s an error on the bootup screen, or the hard drive has stopped spinning. It’s dead. What happens next?

For the sake of this discussion, assume that you’ve got a good backup. On Sunday, your backup system delivered a perfect backup to the external hard drive attached to the server, or your data was safely backed up online. That has been our goal until now: you have a good backup and no data will be lost. Great! All we need to do is restore from the backup and get back up and running.

The first few hours on Monday morning are spent trying to figure out what happened. By definition I can’t do anything remotely so you wait anxiously for me to arrive. I’ll spend an hour or two trying to get the server to come back to life. Depending on your setup, I might be looking at the administration utility for the RAID controller that runs the hard drives, or booting from special repair tools on USB sticks, or yanking the hard drives to test them in other systems. Time goes by.

While I’m working, your office is down. Files on the server are unavailable. Your line-of-business program won’t start. The registers in the tasting room won’t work. Your printers are disconnected. In many offices, the server handles DNS for the workstations so no one can get online. If you use Office 365 for your mail, you can still get mail – on your phones.

If you’re lucky, the server comes back to life. The problem turns out to be solvable – a failed update, a UPS that needs to be replaced but can be bypassed, something else that’s scary but solvable. It happens. Servers don’t fail very often.

Today, though, we’re imagining the death sentence. I call you into a meeting with a glum expression and deliver bad news. The server is dead in a difficult way.

It’s already noon. Your employees have been idle all morning.

These are your options.

  •  We can begin the process of obtaining warranty support from Dell or HP. The server support teams are pretty good and you have an expectation of next-day support for servers under warranty. If it goes well after a couple of hours on the phone with someone in India, Dell might send somebody out the next day with some replacement parts. Of course, if it doesn’t go well, that person won’t arrive, or will arrive with the wrong parts, or the server will have unexpectedly difficult problems that the tech isn’t prepared for, or Dell will ship parts that don’t arrive for a couple of days. And in any case, Dell doesn’t take any responsibility for getting the server up and running. Their warranty obligation is over when the server is capable of operating. If you want to restore from your backup, that’s your problem, after the hardware problem is fixed.

  •  We can try to speed things up by getting replacement parts without the potential delays of dealing with tech support. Warranties are fine but hard drives are cheap, and maybe we’ll get lucky with a replacement hard drive. There are hard drives and there are hard drives, however, and we’re not going to get a server-quality hard drive from Best Buy, so there’s likely to be a delay of a day or two in the best case – and the job of restoring from the backup is still to be done (more about that below).

  •  If the server is truly deceased, then we start to run out of good options. You don’t have a spare server in the closet that is identical to your production server. I don’t have one either. Nobody in town sells servers. If we have to order a new server – well, Dell will ship one in a week or ten days, and we still have a difficult job to bring the business back to life.

Notice that none of those scenarios have you back at work that day. At best, you’re up and running towards the end of the following day. At worst – ah, the worst is bad.

At this point you’re looking at me angrily and saying, “I thought we had a backup!”

You have a backup. A backup is only useful if you have a working server. You don’t have a spare server in the closet, remember?

Most very small businesses are doing backups with Windows Server Backup, the program included with all versions of Windows Server. It’s a good backup program. In another article, I’ll tell you about some of the reasons that I want to switch to another program, but Windows Server Backup is okay.

  •  If you have a backup done by Windows Server Backup on an external hard drive, and you have identical hardware to the dead server, you can restore an exact clone of the server relatively quickly – a few hours. That’s the “server in the closet” scenario. It requires literally identical hardware, meaning you bought two servers on the same day and stuck one aside. You can’t buy a server today that’s identical to your four-year-old server; Dell doesn’t make that model any more.

  •  If you have a backup done by Windows Server Backup on an external hard drive, and you have a new server that is not identical to the dead server, maybe – maybe – we’ll be able to restore the backup in its entirety in a few hours to empty hard drives. Windows Server Backup is supposed to be able to do a bare metal restore to “similar” hardware. In practice, it’s pretty quirky about that. My luck has not been good.

  •  The alternative is to take that new server and build it up from scratch – install Windows Server, install the updates, install SQL server, install your line-of-business programs, set up the domain, set up the user accounts, match the information manually, then run the restore routine in Windows Server Backup and restore the files and databases. It’s arduous, time-consuming work. It will probably take more than one day. That only happens after there’s a new server onsite, and where did that come from again?

Maybe you also have an online backup. Online backups typically include the data files in the Company and User folders, and perhaps the backup files from your database or line-of-business programs. The files can be restored to a working server, so it’s an alternative in case there’s a problem with the backup on the external hard drive, but it doesn’t solve the hard problem if there’s no working server to start the process.

There are many variations on what happens next. In the next article I’ll describe how it went for a small law firm last month, where we came up with some inventive ways to get everybody back to work (more or less) in a couple of days. I’m a creative guy. There are many ways to stitch together a system that works well enough that employees can get online and perhaps have access to their files. It might take longer for the case management program, the tasting room cash registers, the time and billing software, or the document management program.

Creativity will only get us so far. When your server fails, there is a high likelihood that your business will be down for anywhere from 1-2 days to a week. When you’re back up and running, remember, no data has been lost! Your backups worked as designed.

If all the technology in your business has been dead for several days, you’re not going to think it was a very good design.

In the next article, I’ll tell you an anecdote that focuses on the dismal reality of server failure. Then: good news! I have some ideas about how to improve your backups and reduce your downtime to a minimum. It will take a little money and some additional management but that’s what it means to evaluate risk and take steps to control it in a changing world. You’re more dependent on technology than ever before. It’s time to adjust your setup to meet your evolving needs.