Rescuing a non-booting server
Rescuing pin.setti.info
The server hosting provider has got really advanced rescue system for dedicated servers. It is possible to activate rescuesystem through web control panel and then remotely reboot the server. When the server reboots, it will boot up the rescue system from the network instead of its own hard drive. The system requires “preboot execution environment” (PXE) capable network interface card. The result is that it is possible to get complete root access on the server without loading any files from the hard drive itself. The system however, is running by the processor and in the memory of the server. Cool.
New kernel install almost everytime fails couple times before it starts working. At those times it is nice to have good rescuesystem. The new kernel installed on the server yesterday was no exception in terms of failing. Unfortunately for some reason the rescue system wasn’t automatically activated, but it required some manual rebooting - for which reasons will remain mystery.
After having the rescuesystem at hand, it should be easy task to revert the change in the Linux’s boot loader (LILO) and run the boot loader command again to write the old boot sector back to the disk. This is how it worked on the previous server. You didn’t even notice when the kernel was updated many times. Unfortunately the easy part stopped to the “reverting the change” part. Executing the changes back to the disk was whole another matter.
There is direct access to all devices in Linux in directory /dev. Everybody can try this out by playing weird tunes of their hard drive by command:
cat /dev/sda > /dev/audio
The command will send contents of the hard drive to the sound card, which will make its best interpreting the weird stuff that’s directed to it.
The access to the hard drives becomes critical when you want to write something as exotic as boot sector on the hard drive. The file /dev/sda must be there for LILO to write at.
The way to make use of the rescue system is to make the rescue system compatible with the actual server. Then the rescue system can write the wanted boot sector and kernel to the server’s hard drives and the server can be booted normally. The problem is to make the compatibility to work.
The ages-old way is to boot a rescue system, mount the server’s root partition at /some/directory, change to the directory and write “chroot” to make the server’s root partition the root partition in the rescue system too. Then all commands written on the rescue system are actually the commands installed on the server. That way it is possible to run the server without ever actually booting from it.
Now, the problem, for some reason, is that the files under /dev/ are difficult to create. They are special files called “block or character files”. In most Linux distros there are boot scripts under /etc/init.d/ or /etc/rc.d/ which create the files. Usually the command is “makedev” or something similar. In Debian, the command does not work, because the script needs procfs, sysfs and probably process called “udevd”, which has real difficulties getting started in the “chrooted” environment. The makedev script only succeeded creating about ten useless block files under /dev.
The solution is to create the required special files for hard drives manually.
The special files have two weird properties: “major” and “minor” device number. The numbers define somehow to which device they are connected to. It seems that the major number is directly relative to the numbers at /proc/devices. The minor is something else - but with the hard drives it’s the number of the partition.
So, now it’s all just creating the access files to get through to the hard drives.
mknod /dev/sda b 8 0
mknod /dev/sda2 b 8 2
Then execute “lilo” and voilĂ . The working boot sector and kernel written on the disks! Reboot & pray.
Works.
PS.
Next time it might be good idea to maybe try copying the working kernel files on the server’s hard drive on the rescue system’s file space (it’d be wrong to say rescue system’s hard drive, because the rescue system does not have hard drive but it’s using the servers’ RAM) and then try to execute LILO completely from the rescue system’s filesystem.
