Wednesday, October 15, 2008

When I asked for "normal", this is not what I meant!

Of course, just as I'm settling into my day yesterday, I get a message from one of the operations people, "the primary FTP site isn't working".

I log in and check it, and he's right. The port is open, but no one's home. I ssh'd in fine, the daemon is running, and things seem fine. Of course I checked the logs at that point, and there wasn't an error. In fact, there wasn't a log entry past 12:40am.

I've had things like this happen before, and when even the logging dies, your disk is probably full. df -h said this:


# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda4 34G 3.9G 30G 12% /
/dev/sda2 1.9G 50M 1.9G 3% /var
/dev/sda3 1.9G 68M 1.9G 4% /tmp


Everything looks fine there, but something was obviously wrong. Time to check /var/log/messages

Oct 14 00:40:49 bv kernel: sd 0:0:0:0: scsi : Device offlined - not ready after error recovery
Oct 14 00:40:49 bv kernel: sd 0:0:0:0: scsi : Device offlined - not ready after error recovery
Oct 14 00:40:49 bv kernel: sd 0:0:0:0: SCSI error : return code = 0x00000002
Oct 14 00:40:49 bv kernel: sd 0:0:0:0: SCSI error : return code = 0x00000002
Oct 14 00:40:49 bv kernel: sd 0:0:0:0: SCSI error : return code = 0x00010000

Well, hell. That's not a good thing at all. Looking at the errors on the console, I determined that reiserfs (these are older slackware machines) driver had dispensed with the journal, and the drives were now read only.

This is the reason you have a backup server.

Remarkably, despite not being able to write to the disk at all, the client website still ran fine, since it's just an apache instance that forwards to an internal tomcat server. That bought us some time.

I verified that the backup was prepared, and coordinated with the operations people to swap the site to the new machine. We got the service swapped over, and I quickly configured a spare machine to act as the backup of the now-running production machine, and then took a look at the broken box.

As it turns out, the controller crapped out. It still won't boot at all, and it says the drives are "incomplete", whatever that means. In addition, I can't even get into the controller bios to reinitialize the array. Great.

So that's how yesterday went, and why there wasn't an entry for it :-) Hope yours went better.