The Curse Of Downtime: A Story From Real Life

The curse of downtime: a story of a dead server

Servers fail without warning sometimes.

This is what happened in a law office last month when the server failed. Each failure has its own story and the details of your story will be different if you’re ever in this position, but there’s a broad point to be made: unexpected downtime has become hard to tolerate.

The setting: two busy lawyers and a small staff, working from a Windows 2008 server that held 50Gb of Word documents and PDFs in a shared FirmDocs folder, plus a database-driven program – PCLaw – that they rely on daily for timekeeping, billing and accounting.

Everything was by-the-book for a small business or law firm, and everything worked as designed. That’s both the good news and the bad news.

The server was a five-year-old Dell PowerEdge T300. It had a Dell hardware RAID controller with two mirrored hard drives in a RAID1 array. If one drive fails, the server keeps operating normally and starts to beep to call attention to the failure.

The server was monitored by Bruceb Remote Management. There were no warnings of any kind that the server might fail.

Onsite backups were being done nightly by Windows Server Backup to an external hard drive.

Online backups were being done nightly by Bruceb Cloud Backup.

On a Wednesday morning in April, I got a call when the first people in the office were not able to open any files in the FirmDocs folder. The Bruceb Remote Management dashboard had actually raised a flag in the middle of the night when the server failed to check in. When a monitor and keyboard and mouse were hauled across the office and attached to the server, we could see that Windows was not running and the server was stuck on a black screen with an ambiguous error message. It appeared that one hard drive had failed. That’s not good but it’s not as if smoke was pouring out of the case. The server is designed for this. Drives can be replaced without a lot of disruption.

Almost three days went by while we tried to revive the server. When one drive fails, the server keeps running – that’s the whole point of the mirrored drives! We tried to start the server without the drive that was reported as failed; we tried to replace the failed drive with a new one and rebuild the RAID array; eventually we got two new drives, removed both of the existing drives, and built a new RAID array.

Time passed. Everything takes time. Example: there were no spare hard drives lying around. At one point we literally borrowed a hard drive from the office down the hall.

With a new RAID array, we could rebuild the system from scratch, which starts by installing the same version of Windows Server from the install disk. We couldn’t find the disk. More time went by while I tracked down an .ISO that could be transferred across the Internet to the office, then burned to a disk. Burning the disk failed a couple of times until we figured out another error message (it turned out that the workstation was running a 32-bit version of Windows and creating a bootable disk for a 64-bit OS would only work from a 64-bit machine).

More time passed.

We booted from the DVD and ran the Windows Server install routine, which proceeded normally for fifteen minutes until a progress bar appeared onscreen that wasn’t moving. Was it working? There was no way to find out without walking away in case it came back to life.

Tick, tick, tick.

We waited an hour. The installation did not start up again. The server appeared to have frozen in the middle of the install. What?

We repeated the process and got the same result – but again, it took time to be sure.

We put in a Windows 7 install disk and – well, I’ll be damned. It installed perfectly normally. The server hardware and hard drives appeared to be just fine. But we put in the Windows Server disk and got the same result – for unknown reasons, the server had decided it was never going to run Windows Server 2008 again.

We were past midday on Friday, three days into this exercise, and only now had it become clear that this server was not going to recover. Each step was reasonable and each one was done with confidence that it might result in a complete recovery. Three days!

The attorneys and staff were getting their Office 365 email and they were able to get online, but they had been cut off from their documents when the server failed. Everything important was in the shared FirmDocs folder on the server. Deadlines for court filings still had to be met. They could work around being cut off from the accounting program but the documents had to be accessible.

Early on, we looked at the backups.

The Windows Server Backup on the external hard drive could only be restored from a computer running Windows Server. We didn’t have a computer running Windows Server. That backup wasn’t going to be helpful right away.

We turned to the online backup. I created a cloud server on Windows Azure, installed the Bruceb Cloud Backup software, and restored the files to the online server. Then I created a user on the online server, shared the folder with that user, mapped a drive letter on each person’s workstation using WebDAV, and supplied the online user credentials to get permission to read and write the files.

The result was that the users had complete access to everything in the FirmDocs folder relatively quickly after the server failed – on the second day of the outage, anyway.

(Geeky detail: this was an incredibly cool thing to do, and nothing like it would have been possible a year ago. There’s a detailed walkthrough here. Running that online server for three weeks cost $21. Yeah, there were some quirks and glitches, but really – is that great or what?)

On day three, while we were coming to the realization that the server might not come back, we decided to move the FirmDocs folder to a spare Windows 7 computer in the office – opening and saving large files from the cloud server was taking extra seconds and driving people crazy. I restored the Bruceb Cloud Backup online files and mapped a drive to the local computer. We were able to install PCLaw on that spare computer and restore the PCLaw data, then map the user workstations so that program also came back to life. There were, of course, glitches in that process, too.

You haven’t forgotten that the clock is ticking in the background, right? Every single step takes time to accomplish – time that the attorneys and staff are spending on frantic efforts to work around the obstacles that have been thrown up.

Although everyone was working and connected to their files, that wasn’t the end of the work that needed to be done. The backup programs had to be reconfigured for the files in their new locations. We discovered a bug in the Bruceb Cloud Backup program: it had restored all the backed-up files, including files that had been deleted or moved from the server, instead of restoring a snapshot of the server as it had been on the day before the crash. Tech support at Cloudberry Labs: “Oopsie! We know about that bug but it won’t be fixed until the end of May.” They’ve been responsive and the program is trustworthy so I’m trying not to judge them.

As a result, we had to look at the Windows Server backup on the external hard drive. That was successful, eventually. I learned some new things about its strengths and weaknesses during that exercise. (It probably tells you something that I wound up with a strong belief that I have to find a different backup program for my clients.) In the end it took two weeks to restore the definitive document folders from the day before the server crash.

Since the server was never going to exist again, I had to disjoin all the workstations from the domain with as little disruption to the user profiles as possible, before the workstations started to refuse to log in because the domain server was gone. More time, more expense, more disruption. Did I mention the workstation that unexpectedly had the user Documents folder redirected to the server – something I didn’t discover until after disjoining the domain, making all those documents unreachable? Ha ha! Good times. They came back after I ran System Restore, thank goodness, leaving only the interesting question of why it took more than eight hours for System Restore to be completed. I still can’t figure that one out.

It’s not over. There are still decisions to be made about whether to set up a new server or perhaps move to the cloud and store documents in a service like NetDocuments.

Servers fail. There is nothing unusual about this sequence of events to restore order. It’s the planned result of the backup schemes used in small businesses for more than ten years. At one time you could accept some downtime and chaos – you were less dependent on technology to do your job, and the alternatives were unreasonably expensive.

Times have changed. There are inexpensive ways to minimize downtime and recover more elegantly in a crisis. The cost of downtime has skyrocketed – if your technology isn’t working, your business is likely being harmed right away.

I have some suggestions. We’ll take a break for an article or two, then I’ll return with some ideas for new backup software and a new plan that will help you rest easier.