Thursday, May 22, 2008

Backup Scheme



Since I'm working on backups today, I thought I'd share a simplified version of how and when my backups get taken care of.

Since we're 99% Linux here, cron jobs take care of everything for us. The cron jobs call shell scripts that determine what day of the week it is, and what needs to be backed up based on that.

In order to determine exactly what should be backed up, I wrote a small bit of code that parses a config file. The code is ugly, but here it goes:

cat $CONFIGFILE | grep -v \^\# | grep -v \^\$ | awk -F: '{print "echo \"Syncing " $2 " on " $1 "\"\ntime rsync -e ssh -az " $4 " " $1 ":" $2 " " $3 ";"}' > $BINDIR/fsync-$DATECHUNK.sh


Your mileage may vary ;-)

Anyway, it just creates a temporary script to execute. It's insecure, since it creates a race condition, and I don't recommend anyone go that route. A much better solution would be a perl script that queries a database. Much much better.

Anyway, the wide view is that every night, 365 days of the year, the backup server attached to the XRAID receives the most recent data from the secondary file server (which gets it from the primary file server). On Mondays, we do a full backup from the XRAID to a large external USB drive, and the last 30 days worth of data to a smaller external USB drive. Tues - Friday, we just do the 30 days onto drives. Saturday we do a full backup on to tapes, which are removed and placed in storage.

Our reasoning is this: If all of the file servers die, or the storage dies, or what have you, we could get up and running temporarily with the last 30 days worth of data. Since time would be of the essence, and it's no fun to copy 500GB off of a USB drive, even with USB 2.0, we have each day's most recent files on an easily accessable drive, which can be plugged into any machine immediately. If anyone disk drive dies, we have the previous 4 as well.

Since we're still going to want the old data, we have a USB drive with that as well, that can copy in the background while we're working live with the most recent data. This can also be plugged into any machine.

If all 5 daily drives are shot, AND the weekly drive is gone, then A) we've likely got other issues to deal with, and B) We can restore from tape, eventually. It takes quite a while to transfer 500GB off of a tape, so that's the last option, but it's there. It's also important that we have more than one tape drive, one here and one in another location.

It would be a good idea if we sent tapes to the location with the other tape drive, but I don't have that in place yet. It will come soon, though.

The co-location we're looking into has a backup SAN as well, where we can store the 30 days' worth of data. To need to recover from a tape in that case would require the situation to be pretty dire (or for the operations staff to not realize that a file has been missing for > 30 days). Not likely given how paranoid they are, as a general rule.