Tuesday, October 21, 2008

Is RAID 5 a risk with higher drive capacities?

There's a very interesting discussion going on over at ZDNet about RAID5 and hard drive capacities. The premise of the discussion is that unrecoverable read errors are uncommon, but statistically, we're approaching disk sizes where it will start to matter. Here's a quote from the blog entry:

"SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 100,000,000,000,000 bits, the disk will very politely tell you that, so sorry, but I really, truly can’t read that sector back to you. One hundred trillion bits is about 12 terabytes. Sound like a lot? Not in 2009."

That would mean a bad block when trying to read. It wouldn't be such a problem, except when it happens while you're rebuilding a RAID array after a drive failure. New drive failures are 3% for each of the first three years, after that, the rates rise quickly, according to that author and Google, who he referenced for the numbers.

So the problem becomes a RAID 5 array with a drive failure. Pull the disk out, add a new one in, and the array has to rebuild. Once every 12TB on average, that rebuild will fail, according to statistics.

Commenters have pointed out that the loss of a single block doesn't necessarily mean the array can't rebuild, just that the non-redundancy means loss of that particular bit of data. With backups, you can restore the individual file and have a functioning array. I think it would depend on the controller, but I don't have any data to back that up.

The author argues in favor of more redundant RAID mechanisms. RAID 6 can tolerate the loss of two drives, and other raids can lose even more, depending on the particular failure.

Just the other day, I had a RAID 0 fail, but that was from the controller dying. Have you ever had an array die during rebuild? How traumatic was it, and did you have a backup available to recover?

Also, if you could use a RAID refresher, I mentioned them a while back.