Friday, February 6, 2009

Backups Suck

Many thanks to Michael Janke for this blog entry

Years ago we had a period of time where we had nothing but problems with backups. Tape drives failed, changers failed, jobs failed. We rarely if ever went a weekend without getting called in to tweak, repair or restart backup hardware or software. It was a pain. One of the many times that the hardware vendor was on site replacing drives or changer parts, I asked the tech:

"Does everyone hate backups as much as I do?"

The answer:


So backups suck, but like it or not they are an essential, and perhaps the most essential part of system administration. If you can't recover from failure, then you are not a system manger or system administrator. Change your title. The one you've been using isn't appropriate.

Here are my thoughts on backups.
Key concepts:

RPO and RTO. If you don't know what they mean, start googling. If you know what they mean, then you should also know what they are for each of your applications. If you have no formal SLA's covering recovery, you should at least have informal agreements between you and your mangers and your customers as to what expectations are for recovery points and recovery times for storage, server and site failures. If you don't know what the expected RPO and RTO are for your applications, you've got a problem. You can't really make backup and recovery decisions without at least some idea of what they might be. At the very least, make up an RPO and RTO and let your boss know what they are. A little CYA doesn't hurt.

Backup versus Archive. You can Google that phrase ('backup versus archive') and gets good definitions. The way I define them, a backup exists to permit recovery from system failure to the most recent recoverable point in time in a manner that meets recovery point and recovery time objectives. An archive exists to permit recovery to points in time other than the most recent recoverable point in time. By those definitions, any backup older than the most recent backup is an archive. In general, backups protect you from physical failures. Archives protect you from logical failures.

You have a valid backup when….

1. The backup is on separate spindles and controllers from the source data.
2. The backup is off site.
3. The backup is tested by successfully restoring data.

If it isn't on separate controllers and spindles then it's not a backup. It might be a copy of the data that protects you against certain failure modes, but it's not a backup. RAID 1, 5, 6, 10, 0+1, 1+0, or whatever are not substitutes for backups. Controllers fail. I've personally experienced a handful of controller failures that resulted in scrambled data. The failed controller will scramble both halves of the mirror and all of the RAID set. So a database dump to a LUN on the same SAN as the database isn't a backup until it is swept off to tape or copied to a disk pool on some other controller/spindles.

If it isn't off site, it's not a backup. If you have a stack of tapes that will get sent off to the data vault when the company that does that shows up at 10am Monday, those tapes will be a valid backup at 10:05am Monday. Until then, they are a copy of your data, not a valid backup.

If you haven't tested it, it's not a backup. Think about that. One of the things I've done to drive home the importance of backups is to walk up to a sysadmins cube and ask them to delete their home directory. I'm the boss, I can do that. Trust me, its fun. :-) If they hesitate, I know right away that they don't have confidence in their backups. That's bad – for them, for me, and for our customers.

A backup is not an archive, and an archive is not a backup. In my world, an archive permits recovery to points in time other than the most recent recoverable point in time. Perhaps because of a regulatory requirement, you need to be able to recover files or databases as they were a month, a year or a decade ago. Then you need archives. If you don't have regulatory or other retention requirements, an archive still protects you against 'logical failures'. For example an archive provides protection against file deletion or corruption that went undetected for a period of time, or protection against accidental or intentional deletion or destruction of data.

But you likely can design a system where backups and archives use the same hardware and software, and in many cases, a backup can become an archive. In the common Grandfather-Father-Son (GFS) tape rotation, the full backups become archives as soon as the next full backup is finished. At that point in time, the full backup is no longer protecting you against server, storage or site failure. It's too old for that. But it is still protecting you against logical failure (a file or database that got corrupted or deleted, but went undetected for a period of time.)
Snapshots, Replication and Log Shipping.

Vendors are more than happy to sell us tools and toys that solve all our problems, but do they really? When do snapshots and various replication strategies protect us against physical and logical failures?

It depends.

We replicate some data (actually 25 million files) to an off site location using an OEM'd version of Doubletake. The target of the replication is a fully configured cluster on separate SAN controllers miles away from the source. That copy of the data protects us against site, storage and server failure (it's our backup). But when a customer hits the 'press here to delete a half-million files' button that the software vendor so graciously provided (logical failure), the deletes get replicated in a couple seconds. The off site replica doesn't help. Those files are recovered from an archive (last nights incremental + last weekends full), not from the backup (the real time replica).

Another example is the classic case where a user or DBA deletes a large number of rows from a table or does the old 'DROP TABLE' trick. If you've configured log shipping or some other database replication tool to protect yourself against site, server or storage failure, you'll replicate the logical failure (the deletion or drop) faster than you can blink, and your replicas will also be toasted. The replication technology will replicate the good, the bad and the ugly. It doesn't know the difference. You need to be able to perform a point-in-time recovery to a point before the damage was done, and replication alone doesn't provide that. Transaction logs, archive logs and similar technologies provide the point in time recovery.

Snapshots tend to complement replication. In general, a snapshot of a disk that is stored on the same controllers protects you against logical failure (it's an archive), but not against site, server or storage failure (it's not a backup). The snap gives you a point in time that is recoverable against logical failure, but not physical failure.

Whatever you have for a backup and archive system, keep in mind

* physical and logical failure
* recovery point and recovery time

And make sure you understand how you will recover from the failure modes within the recovery time to the recover point.

Then – because I teach at a local college, I get to give you all an assignment. It's got two parts:

1. Delete your home directory
2. Recover it from backup

Let me know how you did.

Michael Janke Last In, First Out