dacs.doc electric

 

The Server Won’t Boot
or Why You Can’t Have Too Many Backups

By Jim Scheef

 

A few months back I wrote an article on building a Windows NT server. In that article I stressed hardware redundancy and some installation tricks that would help to repair the server in the event of a problem.

Well, over the last three weeks I’ve had the dubious opportunity to test those tricks when my server failed to boot one morning. Instead of the normal logon prompt, I got the dreaded Blue Screen of Death or simply a "BSD" to those who work with Windows NT.

Whenever you see a BSD on a Windows NT/2000 machine, be it server or workstation, you know you’ve got trouble--big trouble. Normally a server must be recovered post haste, while the people dependent on the server to do their work wait and repeatedly ask "How much longer?" Fortunately or unfortunately, this was my server with the BSD, so no one else would be waiting--but there was also no one else to fix it. So what’s the "normal procedure" when this happens? That, of course, depends on what’s wrong. My first problem was that the BSD wouldn’t sit still long enough to read. The machine would get to the point of the error and instead of displaying the Blue Screen, it would reboot. So I not only could not use the machine, but I couldn’t see what was wrong!

Now the normal plan would be to restore the server from the last backup, and that’s what I would have done, except the last backup was four months old. How, you ask, did I let myself get into this situation? Don’t I know better? OK, ok, yes, I do know better but I let myself fall into THE TRAP.

The Failure

When you hear someone say they lost all of their data, what usually happened? Their hard drive crashed, of course. Looking that the major components of a personal computer, what is most likely to fail? Components with moving parts, like the disk drives (hard, floppy and CD) and fans are just about all there is. All PCs have a fan in the power supply and when this fails, the power supply will fail soon after. Then what? Your computer won’t operate, but in all likelihood, your data is still safely stored on the hard drive. When the power supply is replaced, your computer resumes operation with all your programs and data intact.

Of course the story is different when the hard drive fails. My old laptop ran hot and when it was used daily, the hard drive would fail every 8 to 12 months. I was careful to backup all source code every day so very little was lost there, but email and other things were another matter. Following a hard drive failure I would get religion and back up the laptop frequently. But as months past, I would get complacent—think this hard drive would last longer—and when the time came, the last backup was always at least a month old.

The failure of those small parts inside a hard drive has been everyone’s fear since IBM developed the first Winchester hard drive in the Fifties. Today high performance hard drives spin at 10,000 rpm. It’s no wonder this is the primary cause of lost data.

The Trap

So what can we do to minimize the pain of hard drive failure? Well, how about an exact copy of every byte of data that is updated constantly every second the computer is on? This concept is called "mirroring". The fancy name is Redundant Array of Inexpensive Disks. RAID comes in several flavors and mirroring one disk on another is the simplest variety. With everything mirrored, aren’t you safe? Well that’s what I thought and that is the trap!

Hard drives can be configured as mirrors or stripe sets (another type of RAID, see the side bar) using either hardware or software. Hardware RAID requires a more expensive hard drive controller. Software RAID is configured by the operating system (Windows NT Server) but works with any disk controller. I am using software RAID and as the Microsoft Support Engineer who has worked with me to recover my server said, "Software RAID is only as good as your operating system." What he meant was that when the disks are configured in software and you lose your operating system--as I had done--you are very likely to lose the ability to recover the data on those disks. So, did I loose everything on my server? Was my whole life--all the code I had ever written, every article, all my email and other records all lost forever? Tune in again next month when we will follow the steps suggested by "the Microsoft Guy"--we’ll call him Rick--and learn if there is life after a Windows NT Server crash.


Jim Scheef is the Mad Scientist at Telemark Systems Inc. where he develops custom software using Visual Basic and SQL Server and provides networking services using Windows NT/2000. He has been a DACS member since the day DOG became WC/MUG.

BackHomeNext