dacs.doc electric

 

The Server Won’t Boot
or Why You Can’t Have Too Many Backups
Part II

By Jim Scheef

 

In the first episode of this series I revealed--nay, I poured out my soul--about how I had been lulled into the false complacency of hardware redundancy, and how this had led me to that morning where my server would not boot and the most recent backup tape was over a month old.

So, what, exactly, did I see that fateful morning? Actually, it was worse than a BSD (Blue Screen of Death) as the server was looping thru the failing boot process. The failure report shown on the BSD would flash by too quickly to see and the boot process would start again. In fact even BlueSave (1) from Winternals Software LLC (2) failed to save the blue screen information to disk. It all seemed rather hopeless.

The one bright spot was that the NT Workstation installed on another partition would boot. From this parallel installation, I was able to see the files and directories on the NT Server partition so I knew that the hardware was still working, the disk drives were fundamentally intact and that, in all likelihood, the server was not totally lost. The problem was where to start!

Microsoft to the Rescue

It was definitely time to get some outside help. For several years I have subscribed to MSDN, the Microsoft Developers Network (3). One of the benefits of the 'Universal' edition is two 'incident calls' to Microsoft Support. Since a support call can be upwards of $200, these are a significant value when you need them. I definitely got full value, as you will see.

After a brief conversation with the dispatcher, I was assigned to a support engineer whom we will call Rick. Believe it or not, but I was talking to Rick in less than fifteen minutes from the time I placed the call. When I described the symptoms to Rick, we agreed that the first order of business was to make the server stop looping in the boot process so we could see the BSD and figure out what was causing the boot failure.

What we had to change was a check box on Startup/Shutdown tab in the Control Panel/System applet. The check box corresponds to a value in the registry and tells the system to immediately reboot whenever the system halts. So how do we change a registry value of a system that won’t boot? The answer lies with the NT Workstation installation. Running NT Workstation gives access to the NTFS drives and to RegEdit32. RegEdit32, thru the use of magic clicks and a secret decoder ring, allows one to load individual hives from another copy of NT. The sequence of menu clicks is too complex to describe here, but after several days, I got pretty good at it. Suffice to say that once the registry hive is loaded, you can edit values as if it were the regular system registry. Of course it helps to have someone telling you which values to change.

Something thing I found interesting was that the inactive registry (inactive because we were editing the registry of a system that was not running) had a slightly different structure from what you see when you open the registry on a running system. This was most easily seen in the HKEY_Local_Machine hive where there were three ‘configurations’ rather than just the ‘Current Configuration’ that appears when editing an active registry. This is what allows you to load a “last known good” configuration on boot up.

After changing the ‘Automatically Reboot’ value, I tried to boot the server and was finally able to see the BSD.

Recovery

A BSD shows the problem or error near the top of the screen in typically cryptic Microsoft language. Mine said something like “mspst32.dll is not a valid Windows NT image.” When something like this happens, the normal approach is to replace that one file and try to boot the server, hoping that only that one file is corrupt. I’ll hit the fast forward button on this story and tell you that there were many, many more corrupt files beyond the one named on that initial BSD. In addition part of the registry, the software hive (4) , was also corrupt. This was not a pretty picture.

Now that we knew the magnitude of the problem--just about every executable on the system needed to be replaced--we needed a plan. What we decided on was:

A. Make the server as stable as possible, by disabling as many services as possible.
B. Replace the system files from the \WinNT\System32 directory of the Workstation.
C. Replace the corrupt software hive from the last registry backup.
D. Boot the server and apply Service Pack 6a.

This should produce a minimal version of NT. Naturally it was not so simple and this process took more than a day of work, carefully coached along by Rick. Disabling the services meant editing the registry from Workstation to set the value of ‘start’ to a ‘4’ for disabled.

Unfortunately this minimal NT could not recognize the redundant disk drives. So the next step was to re-enable the disk mirror (RAID 1) on the system partition and the stripe set with parity (RAID 5) on the data drive. Naturally this must be done without corrupting the NTFS drive structure. To do this we used a utility from the Windows NT Resource Kit called Dskprobe.exe which is a disk sector editor. Rick prompted and I read the results off the screen. Gradually, after much probing and reboots, we determined that we could not simply rejoin the two mirror halves. So, after making a tape backup of the barely running system, I deleted the partitions on the old mirror drive and rebuilt the mirror using the regular Disk Manager program.

As the system became more stable we turned our attention to the stripe set. In this case we needed to rejoin the three drives that made up the array, and the drives had to be in the correct order or we would render them useless – assuming they could still be read at all! The process involved a careful analysis of a few bytes on the third sector (I think) of each disk. Once we knew the sequence, we looked in the registry to verify that the registry and the disks agreed. To my great relief, the strip set with parity recovered without any data loss.

It was now time for a full system backup. This would be the just about the last backup using the old 2G native/4G compressed DAT tape drive. A full backup using this tape drive required three tapes and five tape changes to copy the files and then verify the backup. This process took about 8 hours of actual backup time not counting the time everything sat waiting for me to change the tape! Is any doubt about why I thought this was a pain in the ass? So, I ordered a new DDS-4 Sony tape drive that holds 20G native and 40G compressed. Glory be! Once again I can back up the entire machine to a single tape!

Next time

By now Rick and I had been working on the machine several hours a day for a week. Generally when Rick and I ended our call, my work was only beginning as Rick would be sure I had a list of tasks to perform before our session the next day. Sometimes we might loose a day while I completed a tape backup. The status at this point was a system that booted NT Server but did little else. It still would not talk to the network and all the server software, like SQL Server and Exchange Server, was still disabled.

So hang in there until next time in September, when we will recover DHCP, DNS, and WINS, as well as solve a vexing Catch 22 trying to restore a single DLL. Plus, we might learn something about NT networking along the way.

Remember, summer is what you do between ski seasons, so have a good time!

Notes

1. BlueSave is a product that writes the blizzard of numbers on a BSD to a text file that can be printed or emailed.
2. www.winternals.com
3. MSDN Subscriptions is a membership service which delivers essential programming information and the latest Microsoft® software and tools, each month on your choice of CD-ROM or DVD-ROM. See http://msdn.microsoft.com/subscriptions/prodinfo/overview.asp)
4. I now believe that the reason these files were corrupt was that the software registry hive was corrupt. We knew the software hive was corrupt because we could not load it in RegEdit32.

Jim Scheef is the Mad Scientist at Telemark Systems Inc. where he develops custom software using Visual Basic and SQL Server and provides networking services using Windows NT/2000. He has been a DACS member since the day DOG became WC/MUG.

BackHomeNext