Friday, February 27, 2009

February Article Up at Simple Talk: Exchange

My second article is now up and available at Simple Talk: Exchange.

The title is A SysAdmin's Guide to Users, wherein I discuss methods of interpersonal communication and policy writing.

Please check it out, and if you'd rate it, I'd be very appreciative!
Also, if you haven't subscribed to the newsletter, you can do that
here. There are a lot of very informative articles which come out each month, and you don't get spam from it. At least I haven't yet.

I actually wrote 75% of this article in the middle of the night while on my cruise. I woke up at four in the morning, and for some reason I wasn't tired. I wandered down to the all-night restaurant, got an iced tea, and started writing. This is the result.

Oddly enough, a lot of the first article was written at Grand Central Terminal, in the dining level. So apparently to write, I've got to be in a strange location, or a restaurant? My waistline hopes for the "strange location" option.

Thanks! And please check out the article and let me know what you think.

Tuesday, February 24, 2009

Emergency Server Room Light

For those of you reading this on Tuesday, Feb 24th, has LED touch-n-lights on sale 2 for $7.99 ( + $5 shipping )

Little battery operated lights like this make excellent emergency lights for the computer room. If your electricity is out and you've got to work in there, these things are life savers (maybe literally).

I've actually got the round kind that they sell on TV. Here's a froogle search full of similar things.

Monday, February 23, 2009

Musings on Computer Security

This is another from my LiveJournal, written October 14th, 2006:

While reading my new "Netscreen Firewalls" book for work, I chanced upon the following sentence (paraphrased):

"ScreenOS is more secure than open source operating systems, because it's source is unable to be searched for vulnerabilities"

Normally I would ignore such tripe as the rantings of demented mind, but tonight I reflected on it, and on the general outlook of the security through obscurity camp.

As some of you may know, and others may not care, there are (surprise!) differing opinions among computer security professionals.

One camp, the one which I most strongly adhere to, state that open source code, that is, code which can be read by everyone, is more secure due to the many eyes reading the code, searching the code for weaknesses. Eventually the bugs work themselves out, and the code becomes more secure.

The other camp argues that closed source code is more secure, because Evil Hackers(tm) can't go through the code looking for vulnerabilities to exploit (which they, presumably, wouldn't tell the authors about). This is generally known as security through obscurity.

The perpetual fighting between the two camps has shown no signs of slaking off, and I doubt that it ever will. I suspect that the vastly different idealogies between the "share everything" and the "hide everything" individuals will prevent that from occuring.

Here's what I'll tell you though. There are a couple of things that you won't often hear the open source professionals say...such as that security through obscurity does have it's uses, and sometimes it can be handy. Most won't tell you this because they will argue that Evil Hackers(tm) will eventually beat down your security, reverse engineer it, take it apart, and write exploits for it, without your ever knowing it's taking place, and Good Lord when that happens you'll be screwed because you can't even modify the source to the program and fix the hole and you'll probably get fired and have to move to DesMoines and work IT for a pig slaughtering factory to pay for your sins.

Here's the deal. That may, or may not happen. Here's another thing that security professionals won't always admit. Securing a computer system is a gamble. It is gambling in exactly the same manner as insurance is a gamble. Sure, any time you take a measured risk, you must analyze the situation, hedge your bets, and diversify, but in security, how does that apply? It's time to be frank with ourselves here.

A secure government installation will probably have multi-fold security systems. Historically all good security systems involved something you have (such as an ID card, typically) and something you know (such as a passcode). Technology is at the point where we can include "something you are", such as an retinal scan, fingerprint scan, or maybe in a couple of decades, a DNA scan. Someone trying to compromise your account would need to steal your card, find out your pin number, and maybe scoop out an eyeball to get into your system. It's possible, if implausible (and a bit messy), but I'm sure some foreign agents wouldn't hesitate a moment for the right information, and it's that last sentence that seals the deal.

Government secrets are fine, but your shopping list probably doesn't warrant a retinal scanner. My bike used to have a little chain and a masterlock on it. My bank's vault has a door a foot thick. What I'm getting at is you secure something relative to the severity of it's being compromised. A vulnerability in a recipe database might mean a loss of your data, but the exposure of a security issue in my firewall's ScreenOS could mean dire consequences for all of the people who run that particular hardware.

A friend of mine had this as his signature for a while:

"Half of learning any secret is knowing that there is a secret"

Security through obscurity. If we fly under the radar, then the Evil Hackers(tm) won't beat down our door, reverse engineer our software, and write exploits. Probably because they don't really care about our recipes.

Obscurity is a great hiding place for things that people don't care much about finding. Our recipe database, for one. Unless a really bored hacker chances upon it, gets interested, and spends the massive time it would take to reverse engineer it and exploit it, we're probably going to be safe. There would be no return on the time invested.

Obscurity is a really crappy way to hide something that many people are attempting to exploit. Yes, you make it more difficult for the Evil Hackers(tm) because they don't get the source code, but you lose the benefit of having Good Hackers(tm) look at the code too, and fixing it. You are left with your (probably understaffed and overworked) internal development team fixing the holes that they find. Sure, you can license your code out for review, but regardless of who you send it to to have it checked, chances are really good they're not going to be as creative as the community at large, nor as dedicated as a curious hacker bent on understanding your system.

What security through obscurity produces for the interesting target is a loosly tied together community of people cracking away at your security design, testing the perimeters like the raptors in Jurassic Park, and we all remember how that turned out. You might be secure for a while, because you designed your system well, but if there is a vulnerability, and trust me, there most likely is, then once it is found (and it will be), you will be at the mercy of the people who have uncovered the key.

In conclusion (and to review), security through obscurity is acceptible for targets which nearly noone has any interest in. It is not acceptable for interesting targets. If at all possible, you should secure your targets with community produced, time tested solutions, configured correctly and with the proper amount of paranoia, and if at all possible, heap some obscurity on top of that, and hope no one notices you.

Saturday, February 21, 2009

How to ask questions on the internet

This was originally written on October 23rd, 2007 and post in my livejournal. It ends short, but the idea isn't too bad, and the content isn't yet out of date.

As the internet has become ubiquitous over the past decade, it has become our primary research tool. Seemingly endless in the knowledge it contains, we use an array of tools to mine for data, most of them from Google.

Sometimes, our search ends up short. The infinite monkeys on infinite typewriters have yet to produce the particular work we need, and it's up to us to prod them into movement. We must ask a question to the faceless, and hope someone succumbs to our petitioning.

I'm here to help you figure out how to do that.

You'd think it would be easy. You'd think you could just ask what you wanted to know. Folly.

At some point during your research, you must have come across two of the most common instances where someone asked a question and received absolutely no help at all. Allow me to illustrate:

Our goal: Why the foo widget doesn't appear to work when the bar widget is installed

Obvious (i.e. wrong) question:
I've got foo widget, but it doesn't work when bar widget is installed. Any idea why?

While to the non-reptilian brain, this might appear to be a perfectly valid question, if you submitted this to a forum, you would most likely be greeted with derision and mocking. Why? You didn't give enough information.

Suppose you asked the question and received several such replies. You might be tempted to say "screw you guys", find another forum, and ask the same question with more detail. Not a bad idea, but you definitely need to be careful. Here's why:

More detailed (i.e. wrong) question:
All: I've got foo widget version x.y.z running on my slackbuntuhat 7 machine. When I install bar widget 37.9 I am getting the following errors occuring in syslog.
- cut 30 lines of text -
I've read through the sourcecode and at the point it generates this error, the $arrBaz looks like it might be overflowing, but I can't tell if it's a bug or a clever programmer. I modified the source, and fixed the original error, but now I am getting this output:
- cut 40 lines of text -
Anyone else had this problem?

The response:

*crickets chirping*

Alright, you have corrected the initial flaw, a dearth of information. In it's place, you have inserted more information than anyone other than the actual developers of the software are likely to know anything about. Unless you're on a listserve where the authors are frequent posters, you're out of luck.

The proper way is to tease support out of the other commenters. You must coax information out of them a bit at a time, just whetting their curiosity. Maybe start like this:

All, I'm having issues with foo widget x.y interacting with bar widget 37. foo works fine, but when I install bar, it goes belly up. Anyone seen this happen?

After that, you'll probably get a reply asking for what platform it's on, log output, etc etc. Include the information they request, and always end the post with a request for more help. That way, anyone casually browsing by who has the knowledge will see your post being the last, with no reply, and they might respond. If you overwhelm them with information in any one post, you'll get dropped like a hot potato.

Thanks for reading, and good luck.

Friday, February 20, 2009

Undelete any open files in Unix/Linux

Here's a great post by Chris Dew about how to undelete any file that is still open.

There are some unmentioned prerequisites, like a "modern" kernel, but otherwise this should work.

As Chris explains, the underlying magic is because when programs hold a file open and that file gets deleted, the data doesn't get deleted. Read the post for the details. Really interesting stuff!

Incidentally, this is the same reason that 'du' and 'df' disagree sometimes. Giant log files which get deleted but don't clear up disk space, for instance. That's because whatever process wrote them still has the file open. This can be really confusing sometimes!

Step back from the ledge, careercast says it's all ok

Apparently, Computer System Analysts are the third least stressful jobs in America.

According to the description on their site, Computer System Analysts "plan and develop computer systems for businesses and scientific institutions". I'm sure in whatever make believe world they live in, there's a magic wand that you can wave and infrastructures grow like corn. Around here things work a little differently.

Also, if you look farther down the list, Software Engineer is #8.

Great job, guys.

Thursday, February 19, 2009

Bizarre issues almost always point to DNS problems

Duct tape. The Force. DNS.

These are the things that bind our world together. Sure, you can't see the force when you're juggling rocks while standing on your head, just like you don't pay attention to DNS 99% of the time you're browsing the web, but that doesn't mean it doesn't affect everything you do.

Misconfigured DNS has caused more, weirder problems than any other single aspect of networking I've yet encountered. Sure, it causes plain, vanilla connectivity issues when you can't resolve something, but it gets much weirder.

Misconfigured DNS causes mail to break, active directory to stop authenticating (or to even recognize that domains exist), SSH sessions to timeout instead of connect, and an entire host of other problems.

I have even had it cause password issues: the DNS that I was on pointed to a different machine, yet configured identically and with all the same identifiers, and when someone added my account to the machine she was talking to, I couldn't get access. We fought with this for a few hours before I got desperate enough to check into the IP addresses we were connecting to.

This is just a friendly reminder that DNS is everywhere, and if you're having a bizarre network issue, make sure DNS is somewhere early in your troubleshooting checklist.

Wednesday, February 18, 2009

Ideas on maintaining a customized shell environment

Nick Anderson posted an entry today about using a script to customize your home directory.

This is a cool idea. Legooolas took it to the logical conclusion in the comments by suggesting revision control on a centralized server to check in/out your directory.

I like it. I don't think I could really use it, though. My home directory tends to get pretty trashed up. All of the packages I download to play with get thrown there, I don't take the time to clean up all of the things I install from source, and really, it's pretty big at the moment. Somewhere like 10GB if I run du from /home.

The answer we use for our users in the enterprise environment is to have NFS exported home directories. We have a separate fileshare on the fileservers which is mapped to /usr2. It's exported to all of the machines into which users can login. This works nicely since we're on a secured Gb network. Exporting your home directory over NFS on the internet would be...ill advised. CIFS might work a little better, but most people don't have that sort of need across the internet. For the people that do, I'm guessing that subversion would be their best bet, assuming they don't have tons and tons of files to check in/out.

New resource for IT Administrators is being built

I just got some exciting news yesterday. My editor, Michael Francis, just let me know about a project that he has apparently been spending a lot of time on.

The Sysadmin Network is a new social networking site designed specifically for systems administrators and the like to discuss not just technical subjects, but to really concentrate on the social aspect of being an admin.

It might sound hokey, but this is an aspect to IT administration that hasn't been touched on nearly enough. People out there feel pressure. They take abuse. Nearly everyone feels unappreciated and taken for granted, and because of the nature of our positions and personalities, we go through it in relative silence. I touched on this in my burnout post, and there aren't many days that go by that someone doesn't hit that blog entry from a google search about being tired of doing their job.

Of course, that is not to say that we are all just martyrs for our cause. There is a lot of joy in being a sysadmin, too. The feeling of accomplishment as you bring order from chaos. When you build something new and it works, and people enjoy using it, that's a tremendous feeling, and very important. I also think we all get that sort of exhilaration from learning new things, too. Learning, implementing, relearning, and reimplementing is so much of our job, and so rewarding that it probably releases endorphin in our brains like eating hot chili peppers.

I am hoping that this will be an excellent site for sysadmins to discuss topics relevant to our lives. I've already signed up to join, and I think you should too.

Sharing information is what this blog is about, and it speaks to me that the sub-title of The Sysadmin Network is "No more hiding in the server room". Band together for mutual improvement and you will grow more than you thought possible.

Tuesday, February 17, 2009

Things in the lineup

I've been working on a lot of posts for the future, and none of them are really ready yet. Until they come out, I just wanted to let you know what was in the wings that I'm working on.

  • Introduction to Logging

  • Results of the Job (dis)Satisfaction Survey

  • HOWTO: Become a Sysadmin

  • HOWTO: RedHat Cluster Suite

  • Guide to logical network diagrams

If you'd rather see one sooner than the others, let me know in the comments. They're all in an early stage, but the ideas are materialized.

Thanks for your patience!

Fine, fine, I'll do it. I'll microblog too

but I won't like it.

Actually, I think I've finally settled on using twitter to publish all of those weird little links and things that don't really merit an entire blog entry.

Because I don't want to untie those useful little tidbits from the blog itself, I've added an RSS reader to the lower right hand side of the page which will contain my twitter feed. This way if you don't use twitter, you can still take advantage of the links.

I have not carved this in stone, of course, so I'd like to hear some feedback on what you think of this. Do you like even the little blog updates with the links, or does this arrangement improve the signal to noise ratio for you?

Just so you know, feel free to follow my tweets: standaloneSA (since the username can't contain "admin")

Monday, February 16, 2009

Stretching That Almighty Dollar

or "Why is everyone else's dollar more almighty than mine?"

In the immortal words of Yogi Berra, 'The future ain't what it used to be'.

I think, as a species, we're programmed to look at the past with wonder, and the future with hope, but during times of extreme economic toil, it's hard to imagine that flying cars and moon bases are around the corner. Right now, we're suffering through a pretty bumpy spot, and it will get better, but at the moment, resources are scarce and confidence is low. Despite the scare resources, demand hasn't dropped. Anything but, really.

Since we're expected to do more with less, we've got to change our tactics a bit. You need three new machines, but can only get one. Or none. It's time to renew software contracts, but there's no budget. AntiVirus isn't optional on desktop machines, but yours hasn't had an update since Thanksgiving. What do you do?

At times of economic turmoil, the inclination toward software piracy goes up (particularly for small businesses which don't have the resources and visibility of larger companies), so let me please urge you to not follow that route. We are blessed, in a way, to exist in a time where so much quality software can be had for free through legal means. It wasn't always this case, so take advantage of it.

Lets go over some ways that you can spend less (and sometimes no) money and still achieve your goals. Since we're in survival mode, we're going to have to compromise in a few spots. You might say "Free antiviruses won't work as well", but if you can't afford the commercial solutions, free may be all that is left.

We'll break it down by categories: Hardware, Software, and Bandwidth. There are other, non-IT ways to cut down expenses, but we'll limit ourselves since this is an IT-themed blog. I've also included licenses in the software category due to the disturbing trend of companies charging a recurring fee to continue to use the software that you already paid for. It doesn't seem right to me in many cases, but we've got to confront reality as it is, not as we wish it were.

I do have to warn you. Sometimes, you will need to spend money. Hopefully not a lot, but if your tape drive breaks and you can't fix it, and you only have one, buying a tape drive is cheaper than not doing backups. Almost every time. This means that you're going to have to talk to the person who holds the purse strings. When you do, make sure that they understand that the value of the data is the deciding factor, not just the tape drive. Ask them what would happen to the business if the data were to be unrecoverable. That is the risk. Be assertive; it's not just the company, it's your reputation, as well.

Now, when you are required to buy hardware, you don't have to buy retail. In fact, if you're strapped for cash, it's a really bad idea. Refurbished hardware, Ebay, Craigslist, these are possible sources of hardware. If you can't find it from those places and you have to pay retail, at the very least, shop around. Froogle, Pricewatch, maybe even Slickdeals (If you like slickdeals, I just heard about from my friend AJ. You might like it as well).

So find the cheapest source of hardware available, but at the same time, do your best to ensure that you aren't getting ripped off. Check feedback, reviews, and the like. You don't want to be responsible for losing what little money you got from the company.

  • Hardware

  • Nothing is more scary than running critical services on a piece of hardware for which you don't have spare parts. I've done it, I know. If you've got a few machines that are in use, and they're identical, having only one matching spare isn't ideal, but it's better than nothing. Supposing that you have one absolutely critical server.
    Obviously, it would be best to have an identical machine to act as backup, but that's not always possible, so we've got to make do. Identify what it is about this machine that is unique. Is it storage? The number of network cards? Maybe another type of card, such as in a phone server, or maybe the tape drive. Identify that part, and locate the cheapest working spare you can for it. Use the above sources. Ask around.

    If it's a piece of networking equipment, you may be in luck. There are many, many dealers of refurbished network hardware, and the prices are a fraction of retail. An example: if you needed an endpoint for a T1, Cisco would try to sell you a 2801 (or above) for around $1000. Or you could go to routermall and pick up a 2610 and a T1 WIC for around $260, and that's not even the best price out there. Call the dealers and explain your situation. They want your business, and they will negotiate.

  • Software

  • The software that you will be acquiring has a lot to do with the software that you already have. Interoperability is key, so hopefully you have been using suites which utilize open standards. That certainly increases the likelihood of cheap/free software working with it. There are limits, of course. It's going to be another couple years before Samba can function as a stable active directory controller, so for now, if you've got an AD infrastructure, you'll need to stick with that. If you need to implement centralized authentication from scratch, though, there are more options.

    Categorizing all of the available free/open source/cheap software is beyond the scope of this article, but do not be afraid to get your feet wet. Try new software. Test it first, of course, but see what is out there. Here are some resources to help:

    Before you shell out hundreds of dollars for software, go through these sites and see what is available.

    There are certain commercial software suites out there which require you to pay an annual fee for the ability to continue to use their software. AntiVirus products are the most visible, and you could argue in their favor in this case. A lot of work goes into producing antivirus updates. They provide a valuable service. But there are free solutions out there. Given a choice between paying several hundred dollars (or thousands, depending on your infrastructure) that you don't have, or installing a freely available equivalent, I would choose the free software every time. It might not be your preferred solution, but it's better than not having an AntiVirus suite. The same goes for other software that charges annual fees and has freely available alternatives. Again, check the resources above to see what is available.

  • Bandwidth

  • Bandwidth is sort of a sticky situation. You sign contracts on dedicated circuits (T1s and above), and the short (ie the more flexible for you), the more expensive per month. One possible solution is to not go with dedicated circuits. DSL and cable are both faster (download, anyway) than T1, even at the lowest bandwidth selection. The benefit of a T1 is that the bandwidth is dedicated to you, and you get the service level agreements (along with decent quality support in the event of outages).

    If you do decide to use consumer broadband rather than dedicated circuits, make sure that you are allowed to run whatever services over the network that you need to. Some people host their own web or FTP servers, and some providers block those services. Make sure you're not going to have a loss of availability by moving to another provider.

    One thing I have never tried, but I don't see why it wouldn't work, would be to share the cost of the bandwidth with your neighbors. Yes, it's the very same technique used by college kids everywhere to steal cable, but without looking at the contracts, I don't see why a T1 wouldn't allow this. Please drop some feedback if you've done this, or know someone who has, and how it worked out. I'm curious about the legalities. If you do do this, make sure to use firewalls between your network and your neighbor's.

Hopefully this entry either helped you see places where you could save money, or inspired you to look for your own. If you've got any techniques or hints on ways to save money for others, please put them in the comments. I know that a lot of people are going through tough times and can use the help.

Good luck, and thank you.

Sunday, February 15, 2009

CIO Management Methods

I found an interesting link on Reddit regarding IT project management. It's directed towards large scale infrastructures, but I think most people could take something from it, even if that's just another perspective.

PPM: Deciding Where to Begin

Saturday, February 14, 2009

Link: Introduction to Xen Virtualization

I've played around with VMware for years. Not on servers, but on my desktop. I was really familiar with the way it worked when I found VirtualBox, which became my desktop emulator of choice. They both worked in very similar ways, where they would completely emulate an entire machine. Then Xen came out, and I was pretty lost.

I've never taken the time to do the hard work in acquiring a mastery of Xen. I didn't really see the need for it, and I didn't understand it well. I still don't understand it all, but yesterday Joe Topjian posted a link on Adminspotting to an outstanding Introduction to Xen on Debian. A very small bit of the instructions are Debian specific, but it functions well as a general introduction to Xen. There are some questions unanswered, but it seems like a great starting point.

If you are similarly unfamiliar but curious, definitely check it out.

Friday, February 13, 2009

In case you missed it earlier...

Happy .... uhh...consecutively incremented

Spanning tree and dhcp: Knowledge doesn't always imply action

I went through three semesters of the Cisco CCNA training around 6 years ago. I keep most of that knowledge in the part of my brain that I use frequently, so I don't lose it often. I do sometimes get my Cisco CLI mixed up with my Juniper ScreenOS CLI, but I do the same with Linux and Windows from time to time, and anyway, I'm digressing.

Since I use it relatively frequently, you would think that I would apply this knowledge to my troubleshooting endeavors, but no.

I was reading this thread on Tech Republic about switches, and by chance, someone mentioned using portfast to stop problems with DHCP clients.

The lightbulb above my head went on, and it finally connected. For the past few months, I'd had reports of intermittent connection problems in one of our offices. Users would have to manually release and renew their leases in order to get on the network. I had checked the DHCP settings and leases, and everything looked normal. It was just like Windows wouldn't ask for a lease. I *know* that spanning tree will stop DHCP leases. For some reason, I just didn't put the two things together.

Funny how our brain works (or doesn't) sometimes, isn't it?

By the way, if you're interested in learning more about how and why spanning tree blocks dhcp sometimes, check this page. The short of it is that when a port on a switch is first made active (you plug a cable in, or turn the computer on), the switch checks to see if that port is a loop, and it refuses to forward traffic while it's checking. Conveniently this is the same time that the computer is asking for a DHCP lease. No traffic + no dhcp response = no lease. This can be solved by telling the switch that the port shouldn't enable spanning tree on that port.

Running to keep up

Funny how accomplishing things leads to even more on your plate.

Sometimes administration feels like fighting a hydra. You've got a task in front of you, and as soon as you make any headway, four more tasks come as a result. I'm definitely in that mode right now.

I started this week with around 25 items on my 'to do' list. I've accomplished several, and at last count I was at 37. Maybe I'm walking the wrong way on the escalator? In any event, I'm very, very glad that I read Time Management for System Administrators. Heck, without that, I wouldn't even know that I had 37 things to do. Deciding whether ignorance is bliss is left as an exercise to the reader ;-)

Tuesday, February 10, 2009

Oracle Architecture Basics

decipherinfosys posted a link to a really informative page that has an introduction to Oracle architecture. If you're familiar with other databases and have considered checking Oracle out, this might be right up your alley.

If you've been holding off on playing with it because you've heard that it's really expensive, well, yes, it is expensive, but the license states that you can download it and play with it for free, as long as it's "only for the purpose of developing, testing, prototyping and demonstrating your application". So yes, you can get experience playing with Oracle for free, you just can't put it into production.

So read, download, play, and enjoy!

Boy are my arms tired!

Well, I'm back in the States, and working on getting back into the groove. I want to take a second and reiterate my thanks to the many guest bloggers who helped me out and contributed some really great information while I was gone. I really enjoyed learning from each one of the entries, and I can tell from your comments that they were well received. So thank you again, Ian, Bob, Nick, Ryan, Jeff, Phil, and Michael. You guys helped me out a lot, and I really appreciate it.

After Amy and I got back to Columbus, things got even crazier. Due to an (un?)fortunate alignment in the cosmos, we were scheduled to have the moving company come pack up all of our belongings the morning after we got home at 11:30pm from the trip. So I slept about three hours that night, while Amy pulled an all-nighter to get things ready. I don't know how she did it. We got it done, though, and spent Saturday night with our friends Mike and Heather, since we didn't have a bed at that point.

Sunday morning we got up and packed the things we needed into the car and left for New Jersey, and we're there now. Our furniture should be here this morning sometime.

I started in the corporate office yesterday, and it seems like it's going to be a lot of fun. My "todo" list is sitting around a dozen, and that's just because I left it at work. I've come up with several other tasks since then. Since there's no shortage of things for me to do, I'd better get started.

Thanks for reading, and I hope to return to normal blog posts tomorrow. By the way, I'll be uploading photos of the trip to my flickr page, if you're interested.


Friday, February 6, 2009

Backups Suck

Many thanks to Michael Janke for this blog entry

Years ago we had a period of time where we had nothing but problems with backups. Tape drives failed, changers failed, jobs failed. We rarely if ever went a weekend without getting called in to tweak, repair or restart backup hardware or software. It was a pain. One of the many times that the hardware vendor was on site replacing drives or changer parts, I asked the tech:

"Does everyone hate backups as much as I do?"

The answer:


So backups suck, but like it or not they are an essential, and perhaps the most essential part of system administration. If you can't recover from failure, then you are not a system manger or system administrator. Change your title. The one you've been using isn't appropriate.

Here are my thoughts on backups.
Key concepts:

RPO and RTO. If you don't know what they mean, start googling. If you know what they mean, then you should also know what they are for each of your applications. If you have no formal SLA's covering recovery, you should at least have informal agreements between you and your mangers and your customers as to what expectations are for recovery points and recovery times for storage, server and site failures. If you don't know what the expected RPO and RTO are for your applications, you've got a problem. You can't really make backup and recovery decisions without at least some idea of what they might be. At the very least, make up an RPO and RTO and let your boss know what they are. A little CYA doesn't hurt.

Backup versus Archive. You can Google that phrase ('backup versus archive') and gets good definitions. The way I define them, a backup exists to permit recovery from system failure to the most recent recoverable point in time in a manner that meets recovery point and recovery time objectives. An archive exists to permit recovery to points in time other than the most recent recoverable point in time. By those definitions, any backup older than the most recent backup is an archive. In general, backups protect you from physical failures. Archives protect you from logical failures.

You have a valid backup when….

1. The backup is on separate spindles and controllers from the source data.
2. The backup is off site.
3. The backup is tested by successfully restoring data.

If it isn't on separate controllers and spindles then it's not a backup. It might be a copy of the data that protects you against certain failure modes, but it's not a backup. RAID 1, 5, 6, 10, 0+1, 1+0, or whatever are not substitutes for backups. Controllers fail. I've personally experienced a handful of controller failures that resulted in scrambled data. The failed controller will scramble both halves of the mirror and all of the RAID set. So a database dump to a LUN on the same SAN as the database isn't a backup until it is swept off to tape or copied to a disk pool on some other controller/spindles.

If it isn't off site, it's not a backup. If you have a stack of tapes that will get sent off to the data vault when the company that does that shows up at 10am Monday, those tapes will be a valid backup at 10:05am Monday. Until then, they are a copy of your data, not a valid backup.

If you haven't tested it, it's not a backup. Think about that. One of the things I've done to drive home the importance of backups is to walk up to a sysadmins cube and ask them to delete their home directory. I'm the boss, I can do that. Trust me, its fun. :-) If they hesitate, I know right away that they don't have confidence in their backups. That's bad – for them, for me, and for our customers.

A backup is not an archive, and an archive is not a backup. In my world, an archive permits recovery to points in time other than the most recent recoverable point in time. Perhaps because of a regulatory requirement, you need to be able to recover files or databases as they were a month, a year or a decade ago. Then you need archives. If you don't have regulatory or other retention requirements, an archive still protects you against 'logical failures'. For example an archive provides protection against file deletion or corruption that went undetected for a period of time, or protection against accidental or intentional deletion or destruction of data.

But you likely can design a system where backups and archives use the same hardware and software, and in many cases, a backup can become an archive. In the common Grandfather-Father-Son (GFS) tape rotation, the full backups become archives as soon as the next full backup is finished. At that point in time, the full backup is no longer protecting you against server, storage or site failure. It's too old for that. But it is still protecting you against logical failure (a file or database that got corrupted or deleted, but went undetected for a period of time.)
Snapshots, Replication and Log Shipping.

Vendors are more than happy to sell us tools and toys that solve all our problems, but do they really? When do snapshots and various replication strategies protect us against physical and logical failures?

It depends.

We replicate some data (actually 25 million files) to an off site location using an OEM'd version of Doubletake. The target of the replication is a fully configured cluster on separate SAN controllers miles away from the source. That copy of the data protects us against site, storage and server failure (it's our backup). But when a customer hits the 'press here to delete a half-million files' button that the software vendor so graciously provided (logical failure), the deletes get replicated in a couple seconds. The off site replica doesn't help. Those files are recovered from an archive (last nights incremental + last weekends full), not from the backup (the real time replica).

Another example is the classic case where a user or DBA deletes a large number of rows from a table or does the old 'DROP TABLE' trick. If you've configured log shipping or some other database replication tool to protect yourself against site, server or storage failure, you'll replicate the logical failure (the deletion or drop) faster than you can blink, and your replicas will also be toasted. The replication technology will replicate the good, the bad and the ugly. It doesn't know the difference. You need to be able to perform a point-in-time recovery to a point before the damage was done, and replication alone doesn't provide that. Transaction logs, archive logs and similar technologies provide the point in time recovery.

Snapshots tend to complement replication. In general, a snapshot of a disk that is stored on the same controllers protects you against logical failure (it's an archive), but not against site, server or storage failure (it's not a backup). The snap gives you a point in time that is recoverable against logical failure, but not physical failure.

Whatever you have for a backup and archive system, keep in mind

* physical and logical failure
* recovery point and recovery time

And make sure you understand how you will recover from the failure modes within the recovery time to the recover point.

Then – because I teach at a local college, I get to give you all an assignment. It's got two parts:

1. Delete your home directory
2. Recover it from backup

Let me know how you did.

Michael Janke Last In, First Out

From whence I came

Today, sadly, I'll be returning to Barcelona to catch my flight home, and the real world.

Because my life isn't ever boring, I'm moving to Union County, New Jersey this weekend. The movers are coming to take our stuff away tomorrow, and Sunday we'll be driving to NJ to move into our new apartment.

Monday I'll be starting in the corporate headquarters. I've visited several times for a week or so, but this will still be a sizable change from what I'm used to. The wardrobe requirements in particular will be difficult to adjust to. I won't be able to wear sandals and jean shorts in the summer. The horror!

Anyway, I'll cover all of that in the future. Thanks to the time warp of scheduling these entries early, it's now 1am Saturday, January 24th, and in a few short hours, I'll be on a plane bound for the Mediterranean. Have a good weekend!

Wednesday, February 4, 2009

Software patching is the other benefit of virtualization

Many thanks to Philip Sellers for this blog entry!

Our sys admin group seems to be constantly grappling with software patches. We feel constantly behind and reactive to new patches and firmwares that are released and it's a never ending cycle. Since I joined the company a little over 2 and half years ago, I've been asked to write a patch plan for our Windows servers twice, maybe three times. Unfortunately, we have never been able to make these patches happen consistently. We'll make a big push to patch and that seems to break lots of things, which forces us to stop again and fall further behind. Lately, we don't feel we have a choice but to apply some of the recent security holes that Microsoft has plugged. So we're faced with what seems like a catch-22 and so we have pulled the trigger, and have bitten the bullet.

Our last patch push was handled very well, with virtually no problems arising from the patches applied, except with our Citrix farm. Citrix Presentation Servers didn't like one or two of the patches - can't really tell you what happened there, but I know we reverted to pre-patched disks because of problems. The good is that we are finally (mostly) up to date. The bad - the last push was handled almost 100% manually, which anyone in our field will tell you is NOT the way to patch. It's too time consuming, monotonous and wasteful.

What is different today and what has allowed us to realistically look at automated patching today is our virtualization using VMware. Since the last patch plan I drew up, we've virtualized much of our datacenter. Also, most of our newer sprawl has been contained in virtual servers. Our datacenter today is about 80% virtualized to about 20% physical for Windows servers. We began investigating VMware's Update Manager product several months ago and we've been really impressed with the results.

Every good patch plan has a few basics that have to be included, in my opinion. First, you have to know what patches need to be applied - so you need to connect to a patch repository. There are third party software solutions that do a great job of this for a broad group of software products. VMware's Update Manager uses Shavlik to provide much of its update database. The second thing that plan should include is fail-back and recovery. There are times when patches just don't provide the expected results and being able to revert is always critical. Third, you should be able to control the time updates are applied and minimize the amount of sys admin interaction required. For Windows, that can be accomplished via group policy and Active Directory structure or using a third party software like Update Manager. Fourth, you have to make room for exceptions. Every network has these, whether it's the mission critical server that can't afford downtime or it's the self-important system with dictated uptime due to political reasons.

Since we have caught up to current patches on our systems, we've drafted a new patch plan in hopes of keeping up with never getting behind like that again. We settled upon using Microsoft's WSUS (Windows Server Update Services) and VMware's Update Manager as our two pronged solution. These two products hit our two major categories of Windows servers - WSUS physical and Update Manager for virtual servers. Both software allow for the approval of updates and reporting against the baseline of approved updates to see which systems require patching. From there, you can begin the remediation process to bring these systems up to the baseline.

Update Manager also brings the inherent benefits of virtualization to the table when patching is concerned. The Update Manager workflow and scheduler includes rollback snapshots with automatic removals to the workflows. This is a big capability, as we all well know that sometimes patches cause problems or even fail to install. The scheduling features are robust and allow for a fully customized rollout schedule while the administrator just sits back and watches the rollout occur. And, with any automation, there comes a small risk of missing something during the install, but so far, our experience is that the software reports back any problems so that you can give them attention individually. Its also a great solution for our DMZ since the updates are mounted as virtual CD's and installed from this. It addresses the problem of patches filling up a server because they are downloaded, executed, but never cleaned up using Automatic Update. All in all, we feel like we've found a winner.

Mediterranean Island: Malta

Today I'm hitting the small island nation of Malta. To be honest, I don't know much about Malta, other than what the wiki article says. I'm passingly familiar with the Knights Hospitaller, but other than that, I guess I'll be learning when I'm there.

From everything I've read about visiting, it's one of the few places that is actually small enough to take in on a single day. It should be interesting, and if not, at the least it'll be a relaxing day in a beautiful part of the world.

Monday, February 2, 2009

(Really) Small Office Environments

Many thanks to Jeff Hengesbach for this blog entry!

There are a lot of very small businesses. I'm thinking about 20-30 or fewer people, and likely only 1 or maybe 2 servers. There are a few reason this interests me. First, in the past I've done 'side' work in a few of these environments. And secondly, every time I enter a place of business of any sort, I'm always looking for technology and how it is being used. For the folks helping these smaller organizations out, I like to scale back some 'bigger' business concepts and show how they are advantageous for everyone.

The one thing that never ceases to amaze me, when I gain knowledge of it, is the age of the oh-so-critical systems these companies rely upon. My background and philosophy on physical (x86) servers life-cycles is 3-4 years and replace. I follow this cycle for multiple reasons: 1) OEM warranties are cost effective in this window, 2) Always run warrantied equipment, 3) Computational power leaps, storage costs plummet in 3-4 years, 4) It fits a good window for OS / application upgrades and, 5) Equipment is not that expensive in these environments. Of course for large / complicated systems these arguments don't hold as much water.

Over the past few years, I've seen 2 organizations lose servers from aged hard drives and other major component failures. They thankfully both had good backups, but where still out a bunch of time during the replace and rebuild process. If you think disk mirroring / system mirroring is a backup solution, please read this article on

The direction I'm heading with this is to ask small shop IT to consider the use of virtualization. On a small scale the solutions are virtually free, pick the one your expertise best fits. Consider the cost of a down system seriously - it will happen. A Virtual Machine image can be pulled from backup(choose your media wisely), up and running on PC bought from the local big box store in very short order(depending on your VM solution). Get your replacement server, copy the VM over and the case is closed. No Windows hardware driver issues, Authoritative AD restores, configuration oversights, etc. Virtualization will also make for a near painless experience when it comes to keeping physical servers upgraded.

If your small office IT support isn't up to speed with Virtualization, ask them to get there or find new help. The benefits are too great and easy to reap to let them pass by.

Sunday, February 1, 2009

Dear Diary: Jackpot

Today, I set foot in Egypt. To say that it's been a dream of mine since I was a child would technically be accurate, but it wouldn't really convey the full meaning. When I was in elementary school, I was fortunate enough to attend a pilot school called TREK one day a week. This program was set up very similar to college. You picked your own classes, set your schedule, and learned about what interested you. I'm not even sure how many times I took Ancient Egyptology.

Part of the reason we took this cruise in particular was that we stay docked in Alexandria for two entire days. This gives us time to make the trip to Cairo. The amount of history in the slender Nile Valley is amazing. Since time began, people have made their homes here, civilization flourished here in this oasis in the desert.

We're going on a private tour where we will visit the Museum of Egyptian Antiquities, several historic mosques, and of course, Giza plateau, home of the pyramids and the sphinx. Words cannot express how excited I am.

I'm almost home. Pictures will be forthcoming. Punch will be served. Or something like that.