HA Clustering: KISS (and make up)
I like HA-clustering. I like to think that it is actually one of my specialties, and that I'm fairly good at it. 
When I tried to explain what a cluster is, I came up with a very simple explanation that gives an idea of what a cluster can be without all of the technical stuff. Just to give you this example:
Try to think of a car manufacturer that has sites in two locations. Both have the capability to build cars, but only one site is active at a time. Now you as a customer want to be able to communicate with this company no matter where they are currently working from. The way to do so would be a P.O. box. The active site just picks up the mail from this box and corresponds with you.
Say for example that one site would burn down, the other would take over and correspond with you using this P.O. box, and to you as a customer the "failover" to the other site would not be noticeable.
I know this doesn't cover all aspects, but it is very effective way to describe the very basics of a cluster. Anybody can imagine a P.O. box and someone driving to pick up the mail from that box.
Now, at the company where I work we tend to use three main products for our clustering needs. The Microsoft Cluster Service for our Windows platforms, We use a custom created product called PMC (very basic, two nodes with manual failover) and EMC Autostart. All provide a basic failover functionality of shared resources, and usually some means to stop and start things like databases and applications.
All of the people here seem to answer one thing when you ask them about high availability. "Install a cluster" seems to be the common delimiter. But when you ask them what they think when it comes to high availability you get all sorts of replies. Raging from "never down" or "100% reachable" to "guaranteed fast response times" or even the cloning of the runtime instance to other machines.
All are (in my opinion) quite valid responses, but there is one thing that I have learned over the past few years: The more complex the demands, the more stable your environment will be if you keep your design and implementation as simple as possible. Or in short "KISS".
Requirements that are quite popular are for example "I want to monitor the response time of my database query", or "The SAPgui interpretation time should be under $X". Very much like in the uncertainty principle
we can say that as soon as we start to measure the response times of the database, we are also going to have an impact on these response times. And the more complex the demands are, the more you need to take in to account and the higher the costs are going to be. Sun has a nice image displaying this, and it is a general image you will see when you are searching for HA-clustering.
My advice? Try to keep it down to a minimum.
Rely on your hardware redundancy. You can use the N+1 principle there and usually save quite a bit. Also, make sure that the people who are working on the cluster know what they are doing. I've seen most errors here start off by either poorly defined monitors, too many monitors and user error (or PEBKAC).
In short, a cluster is alway complex and tailored toward the application you are trying to make highly available. Keep the design as simple as you can and gather people around you with knowledge of the application so you so can define a good set of working guidelines and monitors. All in all, a case of "KISS".
When I tried to explain what a cluster is, I came up with a very simple explanation that gives an idea of what a cluster can be without all of the technical stuff. Just to give you this example:
Try to think of a car manufacturer that has sites in two locations. Both have the capability to build cars, but only one site is active at a time. Now you as a customer want to be able to communicate with this company no matter where they are currently working from. The way to do so would be a P.O. box. The active site just picks up the mail from this box and corresponds with you.
Say for example that one site would burn down, the other would take over and correspond with you using this P.O. box, and to you as a customer the "failover" to the other site would not be noticeable.
I know this doesn't cover all aspects, but it is very effective way to describe the very basics of a cluster. Anybody can imagine a P.O. box and someone driving to pick up the mail from that box.
Now, at the company where I work we tend to use three main products for our clustering needs. The Microsoft Cluster Service for our Windows platforms, We use a custom created product called PMC (very basic, two nodes with manual failover) and EMC Autostart. All provide a basic failover functionality of shared resources, and usually some means to stop and start things like databases and applications.
All of the people here seem to answer one thing when you ask them about high availability. "Install a cluster" seems to be the common delimiter. But when you ask them what they think when it comes to high availability you get all sorts of replies. Raging from "never down" or "100% reachable" to "guaranteed fast response times" or even the cloning of the runtime instance to other machines.
All are (in my opinion) quite valid responses, but there is one thing that I have learned over the past few years: The more complex the demands, the more stable your environment will be if you keep your design and implementation as simple as possible. Or in short "KISS".
Requirements that are quite popular are for example "I want to monitor the response time of my database query", or "The SAPgui interpretation time should be under $X". Very much like in the uncertainty principle
we can say that as soon as we start to measure the response times of the database, we are also going to have an impact on these response times. And the more complex the demands are, the more you need to take in to account and the higher the costs are going to be. Sun has a nice image displaying this, and it is a general image you will see when you are searching for HA-clustering.My advice? Try to keep it down to a minimum.
Rely on your hardware redundancy. You can use the N+1 principle there and usually save quite a bit. Also, make sure that the people who are working on the cluster know what they are doing. I've seen most errors here start off by either poorly defined monitors, too many monitors and user error (or PEBKAC).
In short, a cluster is alway complex and tailored toward the application you are trying to make highly available. Keep the design as simple as you can and gather people around you with knowledge of the application so you so can define a good set of working guidelines and monitors. All in all, a case of "KISS".
05-'09 YALOEP: Yet another Linux on enterprise post
04-'09 The Oracle cloud?
Comments
Nice and simple explanation about a cluster.
Maybe nice to explain the KISS anagram: Keep It Super Simple, or sometimes Keep It Stupid Simple. This can apply to more than just a clustering solution
The KISS solution should always be closely guarded by the following:
Never underestimate the ingenuity of a total fool.
Or in other wordsą design something in a fool/proof way, and some fool will find-make a new way. Of Course using the KISS principle makes it less likely to screw up and have smart fools around
Maybe nice to explain the KISS anagram: Keep It Super Simple, or sometimes Keep It Stupid Simple. This can apply to more than just a clustering solution
Never underestimate the ingenuity of a total fool.
Or in other wordsą design something in a fool/proof way, and some fool will find-make a new way. Of Course using the KISS principle makes it less likely to screw up and have smart fools around
@Swartzkip: If you check the mouseover on the first mention of KISS, you will see the explanation of the anagram. 
Hardware is almost never redundant by design (99.9% of the commercially available hardware comes in single units only) and thus you need to rely on the supplied/added software and/or hardware to make any cluster achieving an uptime of xx.xxx% And since software has specific requirements on the types of hardware you need to buy, it can't always stay simple & cheap (even if you would only need 2 physical units to comply to the minimum requirements).
@MAX3400 True enough, but these days you will find that on the systems that you are considering on using for HA-clustering, you will need to invest in some sort of hardware failback. You will usually have hot-swap fans, redundant PSU's, RAID controllers for the local disks and stuff like that. If you are talking about big Unix boxes you can even hot-swap the cell boards, and usually you will pay less the for a fully implemented HA-clustering solution.
Basically you want to be able to increase the redundancy of the hardware where possible. This will reduce the risk of having to failover to another node. That is almost always linked to a downtime because you need to stop any applications running in order to switch things like caches or transactions. You could always work with systems like replicated enqueue engines and such, but those often require a three tier landscape.
All in all, I would start looking in the lower end first to improve application availability and work my way up to other HA solutions.
Basically you want to be able to increase the redundancy of the hardware where possible. This will reduce the risk of having to failover to another node. That is almost always linked to a downtime because you need to stop any applications running in order to switch things like caches or transactions. You could always work with systems like replicated enqueue engines and such, but those often require a three tier landscape.
All in all, I would start looking in the lower end first to improve application availability and work my way up to other HA solutions.