Monday, April 13, 2009

HOWTO: RedHat Cluster Suite

Alright, here it is, my writeup on RHCS. Before I continue, I need to remind you that, as I mentioned before, I had to pull the plug on it. I never got it working reliably so that a failure wouldn't bring down the entire cluster, and from the comments in that thread, I'm not alone.

This documentation is provided for working with the RedHat Cluster Suite that shipped with RHEL/CentOS 5.2. It is important to keep this in mind, because if you are working with a newer version, there may be major changes. This has already happened before with the 4.x-5.x switch, rendering most of the documentation on the internet deprecated at best, and destructive at worst. The single most helpful document I found was this: The Red Hat Cluster Suite NFS Cookbook, and even with that, you will notice the giant "Draft Copy" watermark. I haven't found anything to suggest that it was ever revised past "draft" form.

In my opinion, RedHat Cluster Suite is not ready for "prime time", and even in the words of a developer from #linux-cluster, "I don't know if I would use 5.2 stock release with production data". That being said, you might be interested in playing around with it, or you might choose to ignore my warnings and try it on production systems. If it's the latter, please do yourself a favor and have a backup plan. I know from experience that it's no fun to rip out a cluster configuration and try to set up discrete fileservers.

Alright, that's enough of a warning I think. Lets overview what RHCS does.

RedHat Cluster Suite is designed to allow High Availability (HA) services, as opposed to a compute cluster which gives you the benefit of parallel processing. If you're rendering movies, you want a compute cluster. If you want to make sure that your fileserver is always available, you want an HA cluster.

The general idea of RHCS is that you have a number of servers (nodes), hopefully at least 3, but 2 is possible but not recommended. Each of those machines is configured identically and has the cluster configuration distributed to it. The "cluster manager" (cman) keeps track of who all is a member of the cluster. The Cluster Configuration System (ccs) makes sure that all cluster nodes have the same configuration. The resource manager (rgmanager) makes sure that your configured resources are available on each node, the Clustered Logical Volume Manager (clvmd) makes sure that everyone agrees that disks are available to the cluster, and the lock manager (dlm (distributed lock manager) or gulm (grand unified lock manager (deprecated))) ensures that your filesystems' integrity is maintained across the cluster. Sounds simple, right? Right.

Alright, so lets make sure the suite is installed. Easiest way is to make sure the Clustering and Cluster Storage options are selected at install or in system-config-packages. Note: if you have a standard RedHat Enterprise license, you'll need to pony up over a thousand dollars more per year per node to get the clustering options. The benefit of this is that you get support from Redhat, the value of which I have heard questioned by several people. Or you could just install CentOS, which is a RHEL-clone. I can't recommend Fedora, just because Redhat seems to test things out there as opposed to RHEL (and CentOS) which only gets "proven" software. Unless you're talking about perl, but I digress.

So the software is installed, terrific. Lets discuss your goals now. It is possible, though not very useful, to have a cluster configured without any resources. Typically you will want at least one shared IP address. In this case, the active node will have the IP, and whenever the active node changes, the IP will move with it. This is as good a time as any to mention that you won't be able to see this IP when you run 'ifconfig'. You've got to find it with 'ip addr list'.

Aside from a common IP address, you'll probably want to have a shared filesystem. Depending on what other services your cluster will be providing, it might be possible to get away with having them all mount a remote NFS share. You'll have to determine whether your service will work reliably over NFS on your own. Here's a hint: vmware server won't, because of the way NFS locks files (At least I haven't gotten it to work since I last tried a few months ago. YMMV).

Regardless, we'll assume you're not able to use NFS, and you've got to have a shared disk. This is accomplished by using a Storage Area Network (SAN), most commonly. Setting up and configuring your SAN is beyond the scope of this entry, but the key point is that all of your cluster nodes have to have equal access to the storage resources. Once you've assigned that access in the storage configuration, make sure that each machine can see the volumes that it is supposed to have access to.

After you've verified that all the volumes can be accessed by all of the servers, filesystems must be created. I cannot recommend LVM highly enough. I created an introduction to LVM a last year to help understand the concept and why you want to use it. Use this knowledge and the LVM Howto to create your logical volumes. Alternately, system-config-lvm is a viable gui alternative, although the interface takes some getting used to. When creating volume groups, make sure that the clustered flag is set to yes. This will stop them from showing up when the node isn't connected to the cluster, such as right after booting up.

To make sure that the lock manager can deal with the filesystems, on all hosts, you must also edit the LVM configuration (typically /etc/lvm/lvm.conf) to change "locking_type = 1" to "locking_type = 3", which tells LVM to use clustered locking. Restart LVM processes with 'service lvm2-monitor restart'.

Now, lets talk about the actual configuration file. cluster.conf is an XML file that's separated by tags into sections. Each of these sections is housed under the "cluster" tag.

Here is the content of my file, as an example:


<cluster alias="alpha-fs" config_version="81" name="alpha-fs">
<fence_daemon clean_start="1" post_fail_delay="30" post_join_delay="30"/>
<clusternodes>
<clusternode name="fs1.int.dom" nodeid="1" votes="1">
<fence>
<method name="1">
<device modulename="Server-2" name="blade-enclosure"/>
</method>
</fence>
</clusternode>
<clusternode name="fs2.int.dom" nodeid="2" votes="1">
<fence>
<method name="1">
<device modulename="Server-3" name="blade-enclosure"/>
</method>
</fence>
</clusternode>
<clusternode name="fs3.int.dom" nodeid="3" votes="1">
<fence>
<method name="1">
<device modulename="Server-6" name="blade-enclosure"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="3" two_node="0"/>
<fencedevices>
<fencedevice agent="fence_drac" ipaddr="10.x.x.4" login="root" name="blade-enclosure" passwd="XXXXX"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="alpha-fail1">
<failoverdomainnode name="fs1.int.dom" priority="1"/>
<failoverdomainnode name="fs2.int.dom" priority="2"/>
<failoverdomainnode name="fs3.int.dom" priority="3"/>
</failoverdomain>
</failoverdomains>
<resources>
<clusterfs device="/dev/vgDeploy/lvDeploy" force_unmount="0" fsid="55712" fstype="gfs" mountpoint="/mnt/deploy" name="deployFS"/>
<nfsclient name="app1" options="ro" target="10.x.x.26"/>
<nfsclient name="app2" options="ro" target="10.x.x.27"/>
<clusterfs device="/dev/vgOperations/lvOperations" force_unmount="0" fsid="5989" fstype="gfs" mountpoint="/mnt/operations" name="operationsFS" options=""/>
<clusterfs device="/dev/vgWebsite/lvWebsite" force_unmount="0" fsid="62783" fstype="gfs" mountpoint="/mnt/website" name="websiteFS" options=""/>
<clusterfs device="/dev/vgUsr2/lvUsr2" force_unmount="0" fsid="46230" fstype="gfs" mountpoint="/mnt/usr2" name="usr2FS" options=""/>
<clusterfs device="/dev/vgData/lvData" force_unmount="0" fsid="52227" fstype="gfs" mountpoint="/mnt/data" name="dataFS" options=""/>
<nfsclient name="ops1" options="rw" target="10.x.x.28"/>
<nfsclient name="ops2" options="rw" target="10.x.x.29"/>
<nfsclient name="ops3" options="rw" target="10.x.x.30"/>
<nfsclient name="preview" options="rw" target="10.x.x.42"/>
<nfsclient name="ftp1" options="rw" target="10.x.x.32"/>
<nfsclient name="ftp2" options="rw" target="10.x.x.33"/>
<nfsclient name="sys1" option="rw" target="10.x.x.31"/>
<script name="sshd" file="/etc/init.d/sshd"/>
</resources>
<service autostart="1" domain="alpha-fail1" name="nfssvc">
<ip address="10.x.x.50" monitor_link="1"/>
<script ref="sshd"/>
<smb name="Operations" workgroup="int.dom"/>
<clusterfs ref="deployFS">
<nfsexport name="deploy">
<nfsclient ref="app1"/>
<nfsclient ref="app2"/>
</nfsexport>
</clusterfs>
<clusterfs ref="operationsFS">
<nfsexport name="operations">
<nfsclient ref="ops1"/>
<nfsclient ref="ops2"/>
<nfsclient ref="ops3"/>
</nfsexport>
</clusterfs>
<clusterfs ref="websiteFS">
<nfsexport name="website">
<nfsclient ref="ops1"/>
<nfsclient ref="ops2"/>
<nfsclient ref="ops3"/>
<nfsclient ref="preview"/>
</nfsexport>
</clusterfs>
<clusterfs ref="usr2FS">
<nfsexport name="usr2">
<nfsclient ref="ops1"/>
<nfsclient ref="ops2"/>
<nfsclient ref="ops3"/>
</nfsexport>
</clusterfs>
<clusterfs ref="dataFS">
<nfsexport name="data">
<nfsclient ref="ops1"/>
<nfsclient ref="ops2"/>
<nfsclient ref="ops3"/>
<nfsclient ref="ftp1"/>
<nfsclient ref="ftp2"/>
<nfsclient ref="sys1"/>
</nfsexport>
</clusterfs>
</service>
</rm>
</cluster>


If you read carefully, most of the entries can be self explained, but we'll go over the broad strokes.

The first line names the cluster. It also has a "config_version" property. This config_version value is used to decide which cluster node has the most up-to-date configuration. In addition, if you edit the file and try to redistribute it without incrementing the value, you'll get an error, because the config_versions are the same but the contents are different. Always remember to increment the config_version.

The next line is a single entry (you can tell from the trailing /) which defines the fence daemon. Fencing in a cluster is a means to disable a machine from accessing cluster resources. The reason behind this is that if a node goes rogue, detaches itself from the other cluster members, and unilaterally decides that it is going to have read-write access to the data, then the data will end up corrupt. The actual cluster master will be writing to the data at the same time the rogue node will, and that is a Very Bad Thing(tm). To prevent this, all nodes are setup so that they are able to "fence" other nodes that disconnect from the group. The post fail delay in my config means "wait 30 seconds before killing a node". How to do this is going to be talked about later in the fencedevices section.

The post_join_delay is misnamed and should really be called post_create_delay, since the only time it is used is when the cluster is started (as in, there is no running node, and the first machine is turned on). The default action of RHCS is to wait 6 seconds after being started, and to "fence" any nodes listed in the configuration who haven't connected yet. I've increased this value to 30 seconds. The best solution is to never start the cluster automatically after booting. This allows you to manually startup cluster services, which can prevent unnecessary fencing of machines.

Fencing is by far what gave me the most problem.

The next section is clusternodes. This section defines each of the nodes that will be connecting to this cluster. The name will be what you refer to the nodes by using the command line tools, the node ID will be used in the logs and internal referencing, and "votes" has to do with an idea called "quorum". The quorum is the number of nodes necessary to operate a cluster. Typically it's more than 50% of the total number of nodes. In a three-node cluster, it's 2. This is the reason that two node clusters are tricky: by dictating a quorum of 1, you are telling rogue cluster nodes that they should assume they are the active node. Not good. If you find yourself in the unenviable position of only having 2 possible nodes, you need to use a quorum disk.

Inside each cluster node declaration, you need to specify a fence device. The fence device is the method used by fenced (the fencing daemon) to turn off the remote node. Explaining the various methods is beyond this document, but read the fencing documentation for details, and hope not much has changed in the software since they wrote the docs.

After clusternodes, the cman (cluster manager) line dictates the quorum (called "expected_votes") and two_node="0", which means "this isn't a two node cluster".

The next section is the fencedevices declaration. Since I was using dell poweredge blades, I used the fence_drac agent, which has DRAC specific programming to turn off nodes. Check the above-linked-documentation for your solution.

<rm> stands for Resource Manager, and is where we will declare which resources exist, and where they will be assigned and deployed.

failoverdomains are the list of various groups of cluster nodes. These should be created based on the services that your clusters will share. Since I was only clustering my three file servers, I only had one failover domain. If I wanted to cluster my web servers, I would have created a 2nd failover domain (in addition to creating the nodes in the upper portion of the file, as well). You'll see below in the services section where the failoverdomain comes into effect.

In the resources list, you create "shortcuts" to things that you'll reference later. I'm doing NFS, so I've got to create resources for the filesystems I'll be exporting (the lines that start with cluisterfs), and since I want my exports to be secure, I create a list of clients that will have access to the NFS exports (all others will be blocked). I also create a script that will make changes to SSH and allow me to keep my keys stable over all three machines.

After the resources are declared, we begin the service specification. The IP address is set up, sshd is invoked, samba is started, and the various clusterfs entries are configured. All pretty straightforward here.





Now that we've gone through the configuration file, lets explain some of the underlying implementation. You notice that the configuration invoked the script /etc/init.d/sshd. As you probably know, that is the startup/shutdown script for sshd, which is typically started during the init for multiuser networked runlevels (3 and 5 in RH machines). Since we're starting it now, that would seem to imply that it wasn't running beforehand, however that is not the case. Actually, I had replaced /etc/init.d/sshd with a cluster-aware version that pointed various key files to the clustered filesystems. Here are the changes:


# Begin cluster-ssh modifications
if [ -z "$OCF_RESKEY_service_name" ]; then
#
# Normal / system-wide ssh configuration
#
RSA1_KEY=/etc/ssh/ssh_host_key
RSA_KEY=/etc/ssh/ssh_host_rsa_key
DSA_KEY=/etc/ssh/ssh_host_dsa_key
PID_FILE=/var/run/sshd.pid
else
#
# Per-service ssh configuration
#
RSA1_KEY=/etc/cluster/ssh/$OCF_RESKEY_service_name/ssh_host_key
RSA_KEY=/etc/cluster/ssh/$OCF_RESKEY_service_name/ssh_host_rsa_key
DSA_KEY=/etc/cluster/ssh/$OCF_RESKEY_service_name/ssh_host_dsa_key
PID_FILE=/var/run/sshd-$OCF_RESKEY_service_name.pid
CONFIG_FILE="/etc/cluster/ssh/$OCF_RESKEY_service_name/sshd_config"
[ -n "$CONFIG_FILE" ] && OPTIONS="$OPTIONS -f $CONFIG_FILE"
prog="$prog ($OCF_RESKEY_service_name)"
fi
[ -n "$PID_FILE" ] && OPTIONS="$OPTIONS -o PidFile=$PID_FILE"
# End cluster-ssh modifications

I got these changes from this wiki entry, and it seemed to work stably, even if the rest of the cluster didn't always.

You'll also notice that I specify all the things in the services section that normally exist in /etc/exports. That file isn't used in RHCS-clustered NFS. The equivalent of exports is generated on the fly by the cluster system. This implies that you should turn off the NFS daemon and let the cluster manager handle it.

When it comes to Samba, you're going to need to create configurations for the cluster manager to point to, since the configs aren't generated on the fly like NFS. The naming scheme is /etc/samba/smb.conf.SHARENAME, so in the case of Operations above, I used /etc/samba/smb.conf.Operations. I believe that rgmanager (resource group manager) automatically creates a template for you to edit, but be aware that it takes a particular naming scheme.

Assuming you've created cluster aware LVM volumes (you did read the howto I linked to earlier, right?), you'll undoubtedly want to create a filesystem. GFS is the most common filesystem for RHCS, and can be made using 'mkfs.gfs2', but before you start making filesystems willy-nilly, you should know a few things.

First, GFS2 is a journaled filesystem, meaning that data that will be written to disk is written to a scratch pad first (the scratch pad is called a journal), then copied from the scratch pad to the disk, thus if access to the disk is lost while writing to the filesystem, it can be recopied from the journal.

Each node that will have write access to the GFS2 volume needs to have its own scratch pad. If you've got a 3 node cluster, that means you need three journals. If you've got 3 and you're going to be adding 2 more, just make 5 and save yourself a headache. The number of journals can be altered later (using gfs2_jadd), but just do it right the first time.

For more information on creating and managing gfs2, check the Redhat docs.

I should also throw in a note about lock managers here. Computer operating systems today are inherently multitasking. Whenever one program starts to write to a file, a lock is produced which prevents (hopefully) other programs from writing to the same file. To replicate that functionality in a cluster, you use a "lock manager". The old standard was GULM, the Grand Unified Lock Manager. It was replaced by "DLM", the Distributed Lock Manager. If you're reading documentation that openly suggests GULM, you're reading very old documentation and should probably look for something newer.


Once you've got your cluster configured, you probably want to start it. Here's the order I turned things on in:

# starts the cluster manager
service cman start

# starts the clustered LVM daemon
service clvmd start

# mounts the clustered filesystems (after clvmd has been started)
mount -a

# starts the resource manager, which turns on the various services, etc
service rmanager start

I've found that running these in that order will sometimes work and sometimes they'll hang. If it hangs, it's waiting to find other nodes. To remedy that, I try to start the cluster on all nodes at the same time. Also, if you don't the post_join_delay will bite your butt and fence the other nodes.

Have no false assumptions that this will work the first time. Or the second. As you can see, I made it to my 81st configuration before I gave up, and I did a fair bit of research between versions. Make liberal use of your system logs, which will point to reasons that your various cluster daemons are failing, and try to divine the reasons.

Assuming that your cluster is up and running, you can check on the status with clustat. Move the services with clusvcadm, and manuallyfence nodes with fence_manual. Expect to play a lot, and give yourself a lot of time to play and test. Test Test Test. Once your cluster is stable, try to break it. Unplug machines, network cables, and so on, watching logs to see what happens, when, and why. Use all the documentation you can find, but keep in mind that it may be old.

The biggest source of enlightenment (especially to how screwed I was) came from the #linux-cluster channel on IRC. There are mailing lists, as well, and if you're really desperate, drop me a line and I'll try to find you help.

So that's it. A *long* time in the making, without a happy ending, but hopefully I can help someone else. Drop a comment below regaling me with stories of your great successes (or if RHCS drove you to drink, let me know that too!).

Thanks for reading!