Saturday, May 17, 2008

Storage and bandwidth, my two enemies

I have a love / hate relationship with storage. OK, mostly hate, and most of that hate is generated by bandwidth prices. I'll explain...

First, let me give you a rundown of my storage needs, and you'll see what I mean.

At my primary site, which is 2 racks hosted in a co-location, I have an nfs server which holds our primary operations data, and our Oracle database logs (we do archive log shipping to keep the DBs in sync, on Oracle 8i). It's a directly-attached raid enclosure with about 1.2TB of usable space. On that array, the operations data consists of more than 1 million files which together take up right around 500GB. There is another 300GB of Oracle logs, and we typically keep a couple of the most recent DB images on there (100GB or so, compressed).

We have enough storage to keep up with demand until I get the new primary site up, which features an AX4-5 fibre channel SAN storage unit. It's going to have a very comfortable 4+ TB of usable space.

As you can see, my major problem isn't having enough storage. It's getting the data to where it belongs.

Between the primary site and the backup site, we have a dedicated circuit. The sites are only 40 miles apart, so its comparitively cheap. Unfortunately, it's a T1. The delta on the operations data is typically between 10 and 20 GB a day. Add to that the DB archive logs. Depending on how much of the database changes, I've seen as little as 5GB, and as much of 115GB a day. All over a T1.

On a typical day, we don't have issues transferring everything. It takes a long time, but it gets done. On the bigger days, though, everything takes longer than an entire day to complete, and when that happens, all hell breaks loose.

You think a T1 is slow? You should see a T1 with four concurrent rsyncs happening on it. And they just keep on stacking up, because each concurrent transfer slows the previous ones down that much more.

What's the solution? Well, to be honest, there isn't a great one. Assuming that you are using custom written scripts (we are), you can create a "lock file" for a transfer, so if the cron calls that script again, it refuses to run if it sees the file. This, of course, can lead to the backup being 2 days behind, which means that your next sync has double the data to send and takes twice as long as a "normal" sync.

What we do is send many (hopefully) small syncs a day, for each piece of data. We also time our cronjobs so as not to interfere with each other. The operations and DB syncs are each separated by several hours. Even being small, it takes a while to rsync that much data. Hell, it takes a long time for rsync to build the file list for over a million files in each spot.

The problems I've been describing are a major reason that, in our recent shopping for a new co-location in NJ, I've only been looking for companies that have multiple data centers in the North-East area.

When our contract expires with our current location, I fully intend to put our backup site in one of the other facilities owned by our then-primary site, with a different network provider, of course. The reason for this is that most colo companies have massive (to me, anyway) pipes between their sites. Massive to the point that they can sell me a 100MB pipe between my sites for $1000/mo. Compared to my current 1.5MB pipe for $600/mo, that's a hell of a deal.

At that point, I won't have do perform stupid scheduling tricks to make sure my data transfers happen when they should.

Building the rsync index will still take 5 minutes, though.

Sent via BlackBerry by AT&T