Note: this is a translation of an old post. I decided to translate it because now I'm looking for a SysAdmin position (tell your friends!) and I would like this post to show how I work.

Last Saturday I received an email from one of the guys from work with the subject «urgennnnnnnnt: heeeeeeeeeelp»[sic]. He says he was idling on Friday night when his machine stopped emiting sound via the soundcard and then it behaved erratically. When he tried rebooting it, it didn't boot anymore. «It says something about disk not bootable...».

Monday morning I go to work and go to see the machine. Precisely, it said something about «disk not bootable». I boot with a USB key with GRML and I find that the disk has no partitions.

Panic.

The guy is doing a PostDoc in something astronomical (literally) and all his work is in that machine. No backups, as usual, so I prepare myself to rescue the partitions.

In that same USB key I have a system with parted. I boot with it and I try using parted's rescue tool. Nothing. I ask the guy how the disk was partitioned, etc. He tells me that he only installed Kubuntu clicking 'Next'. Kubuntu by default creates a swap partition and an ext3 partition for / and that's it, which made what was coming relatively easy.

I reboot in GRML and I use hexdump -C /dev/sda | more to see the disk's content. This is not the first time that I juggle with partitions and MBRs, but last time I did it, I used a tool that is now discontinued (the tool was called DiskEdit, included in The Norton Utilities), which had special edit modes for MBRs, FATs, and a lot of useful things... in MS universe.

First I confirm that, yes, the first sector is a MBR (at least it has the 0x55aa signature at the end), and that the whole partition table is empty, but in the second sector of the disk there seems to be a copy. I take pen and paper, write down what I found, but it turns out not only I have half the data, the partition I thought I found was too small.

So I decide to look for the partition by hand. To do that I needed to find out first how does the ext3 kernel code know wether a partition is ext3 or not. I knew it would be some kind of magic signature, but I had no idea which. So I installed the sources for 2.6.29 in my laptop and started to look at ext3's code. After going around a lot, including reading the code that is excuted when you mount a filesystem of type ext3, where we can see that it uses a magic signature[3] and the structure of the ext3 superblock, where we can see the magic's offset is 0x38.

So the problem of finding an ext3 partition is reduced to the problem of finding 0x53ef (damn little endian) at a sector's offset 0x38 in the disk. Luckily more has a find tool, so I sit down to search every occurrence of 53 ef, hoping that the address at the left ends in 30 and that they would be the 9th and 10th bytes in the line (damn 0 based offsets).

A few 'next' after, I get my first candidate. It looks good, because I was also comparing my findings with a similar dump from my USB key (which I have formatted as ext2, and luckily ext2 and ext3 share those structures), and also I spot something that looks like a uuid.

This candidate's address is 0x80731038. I substract 0x38 and I get the address 0x80731000, a nice round number for a superblock. Converted to decimal that's 2.155.024.384, some 2GiB from the disk's begginning. Looks really good! The swap partition could be before the root one, and could have that size.

I use fdisk /dev/sda to see the (still empty) partition table. It says there's 16.065 sectors per cylinder, times 512 bytes per sector, equals 8.225.280 bytes per cylinder. Almost all distros (actually I think all of them) partition disks at cylinder boundaries[1], so if the sector I found is right at the beginning of a cylinder...

I divide 2.155.024.384/8.225.280=...

(suspense pause)[2]

262.000124494...

¡Damn! I almost had it... Hmm, how much is the factional part? (262.000124494-262)*8.225.280=... ¡1024! ¿Is it that...?

I run strace debugfs -R show_super_stats /dev/sdb1 (the partition in my USB key) and I see that it actually seeks 1024 bytes within the partition for reading the superblock!

This is it. With 262 in my head, I fire fdisk /dev/sda and I create two partitions: swap in cylinders 1-261 and root from cylinder 262 till the end. I save, cross my fingers and I run debugfs -R show_super_stats /dev/sda1. It fails! What's wrong? I reboot and I try again, just in case the kernel did not re-read correctly the partition table. It fails again. WTF?

Ah, duh, it's sda2, where do I have my head... Ok, debugfs -R show_super_stats /dev/sda2... it works, the sonofabitch works! I can't believe it. I risk it: fsck -n /dev/sda2. «Filesystem is clean». Damn, I try harder: fsck -n -f /dev/sda2...

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda2 etc etc...

It's fine! But the MBR doesn't have GURB installed, so I do the usual GRUB reinstall process, I reboot...

It boots like nothing has happened, and it finishes in a beautiful login. Satisifed, I pat myself in the back, pack my things and I start my weekend.

sysadmin rescue


[1] ... wasting some 8MiB between the MBR and the first partition.

[2] The sharp ones reading this will notice that this can not give an integer by no means.

[3] Reiser magics are funny. Looks like he started the fad that now AdOlEsCeNtS use.

Posted Thu 07 Apr 2011 11:46:15 PM CEST Tags:

Today I decided to upgrade my home server (the one that serves this blog) from lenny to squeeze. Here is a 'log' of the experience.

My first mistake was on the name: it is not squeezy, it's squeeze.

Second, the server once was also a minimal desktop, so I deinstalled a lot of desktop soft to make the upgrade smaller and easier. I simply used my favorite package manipulation toll, dselect[1], and selected for purging all the optional and extra packages in sections libs, python and perl. When the consequences where shown to me, I just marked as install the software I wanted. After that, ~450 packages where removed.

Following the release notes, and after checking the known upgrade issues, I did the first suggested step: a minimal upgrade.

mdione@cobra:~$ sudo apt-get upgrade
[...]
233 upgraded, 0 newly installed, 0 to remove and 142 not upgraded.
Need to get 67.2MB of archives.
After this operation, 11.5MB of additional disk space will be used.
[...]

I have apt-listbugs installed, so I got this question just before accepting the upgrade:

serious bugs of libslang2 (2.1.3-3 -> 2.2.2-4) <marked as done in some version>
 #615909 - Copyright file does not clearly state licence terms (Fixed: slang2/2.2.3-1)
serious bugs of deborphan (1.7.27 -> 1.7.28.3) <marked as done in some version>
 #618895 - orphaner enteres infinite loop on sparc (Fixed: deborphan/1.7.28.4)
critical bugs of initscripts (2.86.ds1-61 -> 2.88dsf-13.1) <unfixed>
 #612594 - On boot thw wait have no job to wait for, and fail into reboot.
serious bugs of libpcre3 (7.8-2 -> 8.02-1.1) <unfixed>
 #616660 - /usr/bin/pcretest must not be shipped in libpcre3
Summary:
 libpcre3(1 bug), libslang2(1 bug), deborphan(1 bug), initscripts(1 bug)
Are you sure you want to install/upgrade the above packages? [Y/n/?/...]

The critical bug for initscripts looked ugly, so I checked it in more depth. It seemed to affect usplash, which I don't use (no use in a headless server, right?), so I bit the bullet and continued. The bug's discussion said that the solution was to purge usplash anyways... The rest of the bugs were, pragmatically talking, not interesting to me.

Then apt-listchanges showed me the unread entries in the NEWS.Debian.gz files for the upgraded packages, with no news that applied to my server. Interesting was the split of pam_cracklib into itself and pam_pwhistory, so now we can test the reuse of passwords without checking for dictionary attacks, in the strange case one would want to do so, or the other way around. That means that if you want both, you got to enable both.

Besides some conffile resolution (it would be nice to be able to resolve diff with meld or xxdiff), the upgrade went smooth.

Upgrading the kernel was painless too. The kernel NEWS included a note on PATA devices rename due to the new SCSI/PATA drivers, but I was aware of that because I read about the upgrade issues first :) I just had to tell linux-base to please update the disk devices in my config files, which is the default anyways.

udev was a little bit more bumpy:

critical bugs of udev (0.125-7+lenny3 -> 164-3) <unfixed>
 #593083 - udev - system hangs at login screen
serious bugs of util-linux (2.13.1.1-1 -> 2.17.2-9) <unfixed>
 #613592 - /sbin/fdisk: Can't create at sector 63
 #613589 - /sbin/cfdisk: Bad Table error after fresh Squeeze install

The first one seemed to be something quite handlable, and the other two were not interesting to me. The bullet just needed a little bit more of squeezing, that's all (Hahahahaaa! I told a joke! I can do this crap too![2]).

For the last step I used dselect again, as I love the way it presents dependency resolution. I took the opportunity to purge all the obsolete packages, and I got no dependency problem with that, which means the upgrade should be complete and smooth. This last step meant:

128 upgraded, 68 newly installed, 32 to remove and 0 not upgraded.
Need to get 96.5MB of archives.
After this operation, 33.1MB disk space will be freed.

Yeah, smooth indeed, except for these:

serious bugs of wget (1.11.4-2+lenny2 -> 1.12-2.1) <marked as done in some version>
 #614373 - wget: mixes dpatch and 3.0 (quilt) (Fixed: wget/1.12-3)
serious bugs of lvm2 (2.02.39-8 -> 2.02.66-5) <marked as done in some version>
 #603710 - root and swap devices on lvm do not correctly show up in udev (missing symlinks) (Fixed: lvm2/2.02.84-1)
   Merged with: 593625
serious bugs of libgssapi-krb5-2 ( -> 1.8.3+dfsg-4) <marked as done in some version>
 #611906 - GSSAPI in krb5 1.8 fails to delegate credentials to W2K8R2 (Fixed: krb5/1.8.3+dfsg-5)
grave bugs of libdpkg-ruby1.8 (0.3.2 -> 0.3.6+nmu1) <unfixed>
 #585448 - Leaves files open as it scans, resulting in too many open files
   Merged with: 600260
grave bugs of dash (0.5.4-12 -> 0.5.5.1-7.4) <unfixed>
 #540512 - dash upgrade breaks mksh-as-/bin/sh
 #538822 - dash fails to upgrade if /bin/sh is locally diverted
grave bugs of grub-pc ( -> 1.98+20100804-14) <unfixed>
 #593648 - grub-pc install fails on RAID1 (unknown filesystem)
 #590884 - grub-pc: upgrading with vmlinuz-2.6.32-5-amd64 kernel fails on device detection
 #612220 - after update to squeeze grub2 don't load the system
 #620663 - grub-pc hangs after upgrading lenny to squeezy
grave bugs of openssh-client (1:5.1p1-5 -> 1:5.5p1-6) <unfixed>
 #607267 - /usr/bin/scp: fails to notice close() errors
grave bugs of elinks (0.12~pre2.dfsg0-1 -> 0.12~pre5-2) <unfixed>
 #617713 - Caches documents in violation of HTTP spec and general sanity
serious bugs of apt (0.7.20.2+lenny2 -> 0.8.10.3) <unfixed>
 #558784 - apt: re-adds removed keys
serious bugs of python2.5 (2.5.2-15+lenny1 -> 2.5.5-11) <unfixed>
 #598372 - python2.5: uses the network during build
serious bugs of lvm2 (2.02.39-8 -> 2.02.66-5) <unfixed>
 #603036 - lvm2: fails to install due to incorrect dependencies in init.d LSB header
serious bugs of munin (1.2.6-10~lenny2 -> 1.4.5-3) <unfixed>
 #619399 - munin shouldn't recreate apache conf on every update
serious bugs of grub (0.97-47lenny2 -> 0.97-64) <unfixed>
 #594283 - grub: non-standard gcc/g++ used for build (gcc-4.3)
serious bugs of insserv ( -> 1.14.0-2) <unfixed>
 #598020 - barfs when there are "invalid" init scripts

I focused on grub-pc bugs, because I don't have a monitor and having the server unbootable is not my idea of a funny way to spend my afternoon. 612220 and 620663 seemed the most graves ones, so I checked them. The latter seems more complicated, so I just added another bullet in my mouth and continued. The rest of the bugs seemed harmless enough for me. The NEWS had nothing either. The moment of truth was coming closer.

During the Debconf part, I chose to put grub-pc in the boot sector of the root partition and chainload it from grub-legacy which is still installed in the MBR. I also kept the old kernel, just in case.

With a mouthfull of bullets in different states of chewedness, I rebooted the server. As spected, nothing happened; that is, it booted fine, no problems, all services still there. Dissapointed, now I' looking intently to a more complicated server for an upgrade.

sysadmin debian


[1] yes, I know dselect is in maintenance mode now, and that it knows nothing about automatic packages, but I mostly know what I need and what not. In any case, what's missing can be installed later. Nothing critical is run in this server.

[2] If you hadn't, you really have to see Ahmed, the suicide terrorist

Posted Wed 06 Apr 2011 08:12:16 PM CEST Tags: