Friday, May 30, 2008

Sign you use vi too often


bandman@newcastle[501]:~$ alias
alias :q!='exit'
alias ls='ls --color=auto'
bandman@newcastle[502]:~$

Thursday, May 29, 2008

You're not just a systems administrator

In my younger days as an assistant network administrator, I would always be assisting my superior in retooling some part of the network. Upgrading or reinstalling a system here, rewiring a switch there. I always felt like we were doing these things for nothing. I mean, here we were working when we could be doing nothing, or even better, playing games on the LAN.

I always followed along, and gradually started doing it on my own, too. I learned the meaning of the old saying "a stitch in time saves nine". It's a good thing, because it wouldn't be too long before I was responsible for the network myself.

A year or so later, I was fixing something that broke, and an idea dawned on me. My role was more than running the network, it was actually being part of the network. Sure I administered it, built it out, and repaired it, but that's the point, really. By performing these actions, you cease to be a separate entity. You're part of the system.

When that occurred to me, I realized that my stated goal within the system was to provide homeostasis. A computer network is not a biological organism, and therefore cannot (yet) provide for it's own regulation. That's where we come in.

I don't make much distinction between systems and the network. Sure, in most places, there are systems administrators and network administrators. Sys admins fix operating systems, software installations, and computer hardware issues, and network administrators fix the parts that connect everyone together, but they're both part of the bigger picture.

The infrastructure is, for all intents and purposes, a non-biological entity. It can get sick, parts can fail, and it requires action to return to balance. We're the homeostasis. We're the part that corrects inbalances, and if the system is large and complex enough, we specialize into teams, each dedicated to keeping the entity living, breathing, and healthy.

Food for thought the next time a hard drive dies at 3am

Wednesday, May 28, 2008

Dell Blade Enclosure issues [updated]

As I write this, I'm on hold with Dell tech support.

It seems that my blade enclosure's DRAC card has somehow become detached from the Avocent KVM card. The general configuration screen shows the card, but none of the management interfaces are able to reach it.

Apparently there's a reconfiguration necessary through OpenManage, which I haven't set up yet, so we're installing it now on my XP VM. Hopefully this will reconnect the two feuding devices.

[UPDATE]

2 hours later, the problem has been resolved. I had to issue racadm commands from the serial console.

the lesson here is: If you're going to change the subnet your blade enclosure is on, change the KVM module's IP first, otherwise everything goes to hell.

Tuesday, May 27, 2008

Once more unto the breach

I'm fighting with centralized authentication again.

I haven't finally decided whether I'm going to do a half-hearted attempt at reliably authenticating Linux against Active Directory or against OpenLDAP (or one of any number of other LDAP servers that people seem to hurl as suggestions). I suspect my problem hasn't been server-software related. More likely a disconnect in my mental process that allows me to fathom the method of interrelation between how a password stored in a directory server enables an account in Linux to log in. Where are the user's data files stored? How is that mapped to the account?

I'm sure that once I grasp this, I'll realize that it's more than just two pieces of software. I'm sure it will be an array of software all working together smoothly when properly configured.

It almost reminds me of my disconnect years ago when learning subnetting. I read, and read, and did the math, and I couldn't get it. I could do the math, I could understand how subnetting worked, but for weeks it eluded my efforts to master it. Something was disconnected in my brain.

On the way to my very first Cisco class, it clicked. What I was missing was not HOW to subnet. I had mastered that. I was missing the WHY of subnetting. The fracturing of whole networks into smaller, and the conglomeration of multiple networks for routing, all finally locked firmly into place. In the span of about 3 seconds, I went from not being able to subnet to being able to teach others.

I'm hoping at some point that I "get" this, or that someone utters the magic words that make me snap out of the rut I'm in, so that I can put this behind me like so many other skills that I have wrestled with.

Friday, May 23, 2008

The Tao of Systems Administration

Here's a link to a very insightful post from Last In - First Out regarding the best-practices of systems management.

It deals with Ad-Hoc Management versus Structured Management, and it's full of outstanding advice.

Read it for yourself.

Too many times, when we're the only person responsible for this sort of thing, do we overlook the "proper" way of doing things because of lack of time, lack of motivation, or lack of knowledge. There's no shame in any of those; they're all hurdles to be passed on the way to more reliable infrastructures. Even if you're the only one who creates accounts, and it's going to take a chunk of your time to develop the tools to administer accounts across your network, that's time well invested. If you're lucky enough to some day get more help, you want a structure in place for that new person to follow, and in the end, even if it's just you, having these tools will allow you to use your time more effectively.

I certainly can't claim to have instituted all these suggestions, but it's definitely something to work towards.

Thursday, May 22, 2008

Twitter

I found this today: http://twitter.com/towerbridge

In case you're too lazy to check it out, the Tower Bridge in London sends a twitter every time it opens or closes, with the name of the boat it's opening for. wow.

I wonder how long it will be before SNMP traps are replaced by Twitter messages from servers.

"I am now shutting down due to: processor meltdown"

Backup Scheme



Since I'm working on backups today, I thought I'd share a simplified version of how and when my backups get taken care of.

Since we're 99% Linux here, cron jobs take care of everything for us. The cron jobs call shell scripts that determine what day of the week it is, and what needs to be backed up based on that.

In order to determine exactly what should be backed up, I wrote a small bit of code that parses a config file. The code is ugly, but here it goes:

cat $CONFIGFILE | grep -v \^\# | grep -v \^\$ | awk -F: '{print "echo \"Syncing " $2 " on " $1 "\"\ntime rsync -e ssh -az " $4 " " $1 ":" $2 " " $3 ";"}' > $BINDIR/fsync-$DATECHUNK.sh


Your mileage may vary ;-)

Anyway, it just creates a temporary script to execute. It's insecure, since it creates a race condition, and I don't recommend anyone go that route. A much better solution would be a perl script that queries a database. Much much better.

Anyway, the wide view is that every night, 365 days of the year, the backup server attached to the XRAID receives the most recent data from the secondary file server (which gets it from the primary file server). On Mondays, we do a full backup from the XRAID to a large external USB drive, and the last 30 days worth of data to a smaller external USB drive. Tues - Friday, we just do the 30 days onto drives. Saturday we do a full backup on to tapes, which are removed and placed in storage.

Our reasoning is this: If all of the file servers die, or the storage dies, or what have you, we could get up and running temporarily with the last 30 days worth of data. Since time would be of the essence, and it's no fun to copy 500GB off of a USB drive, even with USB 2.0, we have each day's most recent files on an easily accessable drive, which can be plugged into any machine immediately. If anyone disk drive dies, we have the previous 4 as well.

Since we're still going to want the old data, we have a USB drive with that as well, that can copy in the background while we're working live with the most recent data. This can also be plugged into any machine.

If all 5 daily drives are shot, AND the weekly drive is gone, then A) we've likely got other issues to deal with, and B) We can restore from tape, eventually. It takes quite a while to transfer 500GB off of a tape, so that's the last option, but it's there. It's also important that we have more than one tape drive, one here and one in another location.

It would be a good idea if we sent tapes to the location with the other tape drive, but I don't have that in place yet. It will come soon, though.

The co-location we're looking into has a backup SAN as well, where we can store the 30 days' worth of data. To need to recover from a tape in that case would require the situation to be pretty dire (or for the operations staff to not realize that a file has been missing for > 30 days). Not likely given how paranoid they are, as a general rule.


Tuesday, May 20, 2008

Environmental Issues in the datacenter

Well, my backup datacenter suffered a little setback yesterday.

Around 5pm, the primary AC tripped the in-line circuit breaker. A couple of hours later, the ambient temperature was right around 95F.

I got the nagios alert at 7:10, and it takes me between a half an hour and 45 minutes to get to the office. I got here at 7:45 and shut down everything that wasn't absolutely critical, got the backup AC running, and then realized what happened to the primary. After getting that fixed, I concentrated on the disk arrays (we've got an XRAID in the rack, and I'm working on setting up the AX-45, which is still on the table). The blade enclosure shut themselves down a half an hour before I got there.

I'm doing to be discussing filing an insurance claim with my boss to replace the drives in the array. I don't think I can trust them to go into the primary site now.

Any of you have this sort of problem? What do you do to help prevent it, or to recover from it?



[Update: I found this blog entry today, completely on accident. Irony, thy name is Everything Sysadmin]


Saturday, May 17, 2008

Storage and bandwidth, my two enemies

I have a love / hate relationship with storage. OK, mostly hate, and most of that hate is generated by bandwidth prices. I'll explain...

First, let me give you a rundown of my storage needs, and you'll see what I mean.

At my primary site, which is 2 racks hosted in a co-location, I have an nfs server which holds our primary operations data, and our Oracle database logs (we do archive log shipping to keep the DBs in sync, on Oracle 8i). It's a directly-attached raid enclosure with about 1.2TB of usable space. On that array, the operations data consists of more than 1 million files which together take up right around 500GB. There is another 300GB of Oracle logs, and we typically keep a couple of the most recent DB images on there (100GB or so, compressed).

We have enough storage to keep up with demand until I get the new primary site up, which features an AX4-5 fibre channel SAN storage unit. It's going to have a very comfortable 4+ TB of usable space.

As you can see, my major problem isn't having enough storage. It's getting the data to where it belongs.

Between the primary site and the backup site, we have a dedicated circuit. The sites are only 40 miles apart, so its comparitively cheap. Unfortunately, it's a T1. The delta on the operations data is typically between 10 and 20 GB a day. Add to that the DB archive logs. Depending on how much of the database changes, I've seen as little as 5GB, and as much of 115GB a day. All over a T1.

On a typical day, we don't have issues transferring everything. It takes a long time, but it gets done. On the bigger days, though, everything takes longer than an entire day to complete, and when that happens, all hell breaks loose.

You think a T1 is slow? You should see a T1 with four concurrent rsyncs happening on it. And they just keep on stacking up, because each concurrent transfer slows the previous ones down that much more.

What's the solution? Well, to be honest, there isn't a great one. Assuming that you are using custom written scripts (we are), you can create a "lock file" for a transfer, so if the cron calls that script again, it refuses to run if it sees the file. This, of course, can lead to the backup being 2 days behind, which means that your next sync has double the data to send and takes twice as long as a "normal" sync.

What we do is send many (hopefully) small syncs a day, for each piece of data. We also time our cronjobs so as not to interfere with each other. The operations and DB syncs are each separated by several hours. Even being small, it takes a while to rsync that much data. Hell, it takes a long time for rsync to build the file list for over a million files in each spot.

The problems I've been describing are a major reason that, in our recent shopping for a new co-location in NJ, I've only been looking for companies that have multiple data centers in the North-East area.

When our contract expires with our current location, I fully intend to put our backup site in one of the other facilities owned by our then-primary site, with a different network provider, of course. The reason for this is that most colo companies have massive (to me, anyway) pipes between their sites. Massive to the point that they can sell me a 100MB pipe between my sites for $1000/mo. Compared to my current 1.5MB pipe for $600/mo, that's a hell of a deal.

At that point, I won't have do perform stupid scheduling tricks to make sure my data transfers happen when they should.

Building the rsync index will still take 5 minutes, though.


Sent via BlackBerry by AT&T

Friday, May 16, 2008

Sharing your keyboard and mouse

Because I have to administer a few Apple XServers, I need to keep a Mac around. Their server administration tools are (to my knowledge) closed source, and the command line tools are atrocious. We also have an XRAID, which I don't believe HAS any command line tools which with to administer it. So I need a Mac.


To this end, I have a 17" antique Powerbook that I keep around. It's big and clunky, and with only a gig of memory running Tiger, it's none too fast. I keep it around because I have to use the Admin Tools, and that's about it. It takes up a lot of room on my desk, and having to turn constantly gets irritating, so a while back, I found a Better Way(tm).


Enter Synergy. It's a client/server program that allows one computer's keyboard and mouse to act as others' inputs. Sort of like a software based KVM, without the V.


On my desktop, I type synergys, which starts the Synergy server. On the client machine, I type "synergyc newcastle". You must specify the server, since synergy can connect to multiple machines at once, which would be fantastic for a NOC team with shared computers.


The configuration is pretty simple. In Windows, there's a GUI, but in Unix, it's a simple config file.


Here's the content of mine:


bandman@newcastle[501]:~$ cat /etc/synergy.conf

section: screens
newcastle:
guiness.local:
harp:
end
section: links
newcastle:
left = guiness.local
right = harp
guiness.local:
right = newcastle
left = harp
harp:
right = guiness.local
left = newcastle
end

section: aliases
harp:
harp.int.ia
end

It's pretty straight-forward. In my example above, I have three machines. Newcastle is my linux machine, and the keyboard / mouse host. To the (physical) right of it is guinness, the mac. To the left of newcastle is (sometimes) another laptop called harp.

When I move my mouse off the right side of guinness, I want it to wrap to the left side of harp (or newcastle, if harp is gone), I specify that harp is to the right of guinness. Vice versa on the left.

Anyway, if you find yourself with too many computers and not enough inputs, give Synergy a shot. It's a great piece of software that I've given a couple of my users, and they love it.

Thursday, May 15, 2008

Virtualization

I think I have just about got the hang of virtualization. Not completely, mind you, but enough to be useful (or dangerous).



I'm implementing VMWare Server for a few things in my environment. I don't have the need (or the budget) for ESX Server, but I can definitely see why and where it would be useful. Combine virtual servers with the power of clustering and you have nearly the perfect solution to 100÷ up time. That is, of course, assuming your physical site doesn't go down. You do have a backup location, right?

What I haven't gotten a handle on yet, mostly due to lack of time, is storage virtualization. I don't really get how storage could be much more "virtualized" than the LUNs I use on my SAN. I'll read about them first chance I get, or one of you could enlighten me, then I'll post on it. Either way.



I have heard of "virtual tape libraries", which appear to be nothing except disk arrays which pretend to be tape drives to the underlying system. I guess if you already had an extensive system in place for dealing with a tape library, it would be handy. Me, I would just edit a few shell scripts.
and be done with it.









Sent via BlackBerry by AT&T

Time Management

If you're at all like me, you probably have ever so infrequent period of inactivity surrounded by weeks and months of chaos. When you're the only one on your team, and you've got a dozen things to do, how do you manage your time effectively?



At one point in time, I had a spreadsheet printed out of things I had to do every day, and I checked them off as I went. Of course, if I checked them off and accomplished all the tasks, I didn't have time to do things the things that only pop up from time to time, or take care of the emergencies, so of course I juggled it a bit. This system definitely left some things to be desired. For one, I had to remember to check to see what I had to do next, and there was no way of alerting myself that the next task needed to be done.



I'm going to be investigating some different methods of creating reminders for myself and auto-alerts. I'll make a blog entry whenever I settle on one, or when I give up, whichever comes first.



Just as important as managing your time is tracking where you spend your time. It's much easier to occasionally evaluate where you spend your time. This allows you to concentrate your efforts as opposed to applying band-aid after band-aid, which only takes up your time and irritates you.



I use Gnome Time Tracker, which looks like this:





It allows me to create multiple activity categories, and then whenever I work on one specific category, I can add a "diary entry" of what I was working on, the most recent of which appears at the right. I use Ubuntu on my desktop, but your distribution should have it in their repositories, or it's available by source, of course.



What do you use to both manage and track your time?



Wednesday, May 14, 2008

Introduction and Welcome

Hi. If you've stumbled on this blog, it was either by mistake, or you're interested in the rough and tumble world of systems administration. God help you if it's the latter.

A good first entry should probably cover who I am, and what my background is. I've been a sysadmin for 7 years now, and with the exception of 3 months after an acquisition, I've never been the administrator of a company which employed more than 20 people.

My first position was as the assistant network administrator of a large dial-up ISP in West Virginia. This was in 2000 or so, and dialup had dropped off most people's radar. Since we were in WV, we had a bit of a reprieve, since so many of our customers couldn't get broadband. I eventually became the network administrator when the previous guy left to work for the government. I stayed until a bit after we got acquired by our bandwidth provider. I presided over the technical aspects of the merger, then I left after the culture went downhill and they wanted to move my job.

I evacuated WV to come to Columbus, OH, where I took a short-term job as a PHP programmer, which paid the bills until I could find an admin position. That patience let me find my current company, a financial services firm with a headquarters just outside of New York City. I work outside of Columbus in what was the primary datacenter. This summer, I'll be relocating to NJ and join the corporate collective. Resistance is futile.

There's my background. How does that qualify me to write this blog?

I know what it's like. I know how it is, in charge of all the computers, whether they're many or few, and being tasked with making them work. At the ISP, we bought maybe 3 new servers the entire time I was there. We inherited hand-me-downs from a partner company that went under, which allowed me to use rackmount servers for the first time. Prior to that, we had 15,000 dial up accounts authenticating against a 3 year old tower "server" running Windows NT4. With no spare. I feel your pain.

I'm very fortunate in that the company that I'm now with allocates money to the IT dept (which consists of me, and my boss, the CTO). When I got here, we had mostly antique tower servers. The purchase of rack mount servers brought us into the 20th century. Client demand of 24x7 access to their financial data dictated that we use a more robust solution than sticking servers in our glorified closet, so we moved production into a co-location center. Growth and the corporate relocation plan now has us creating a new data site in another colocation, this one in NJ, a half an hour from the corporate headquarters.

In this new facility, we're not putting our old, underpowered rack machines. We've ordered blade enclosures, 10 blades each, and a fibre channel SAN to go along with them. We've got a 9TB of disks in an AX-45 to provide storage for the new servers, and once we get the new location completed, I've got to work on the new backup, which will have a nearly exact footprint.

So yes, I've been there. I'm not currently at the "enterprise" level, but I'm getting closer. I've been all the way at the bottom, and worked my way up. I can, and will, offer advice and caveats, and also some of the problems that I encounter as I improve the infrastructure here.

Feel free to ask anything, critique, and offer advice. We're all just trying to learn more.