Microsoft Cluster Service Center

Introduction and Basics

Cluster Introduction and Basics Introduction and Basics
Microsoft Cluster Basics Microsoft Clustering Basics Clusering Links Clustering Links
My Clustering/PCNews BLOG!
 

What is the Microsoft Cluster Server Center?
The purpose of this site is to share information regarding Microsoft Cluster Server, Windows Server 2000/2003 Cluster Services and Network Load Balancing.

Who provides the Microsoft Cluster Server Center?
The Microsoft Cluster Server Center is written and provided by Scott Schnoll and Rodney R. Fournier. Information contained in this site is derived from a multitude of sources. This site is UNOFFICIAL and in NO WAY sanctioned by or affiliated with Microsoft Corp. or any of its employees.

Many sources are used in the preparation of this site, such as books listed above.  In addition, the white papers and other documents/pages listed here were useful in preparing this site.  A very special thanks to Scott Schnoll for letting my copy tons of his information.

Clusters Defined
A cluster is a group of independent computers working together as a single system to ensure that mission-critical applications and resources are as highly-available as possible.  The group is managed as a single system, shares a common namespace, and is specifically designed to tolerate component failures, and to support the addition or removal of components in a way that's transparent to users.  Clustered systems have several advantages: fault-tolerance, high-availability, scalability, simplified management and support for rolling upgrades, to name a few.

There are two different types of cluster models in the industry: the shared device model and the shared nothing model.

In the shared device model, applications running within a cluster can access any hardware resource connected to any node  in the cluster.  As a result, access to the data must be synchronized.  In many such implementations, a special component called a Distributed Lock Manager (DLM) is used for this purpose.  A DLM is a service that manages access to cluster hardware resources.  When multiple applications access the same resource, the DLM resolves any conflicts that might arise.  Along with this sophistication and complexity, a DLM adds significant overhead to the cluster.  Most of this is additional traffic between nodes; however, a performance hit is also realized due to the loss of serialized access to hardware resources.

By default, Microsoft Cluster Server and the Windows Cluster Service use the shared nothing model.  Because this model does not use a DLM, it does not have the overhead incurred by using such a service.  In the shared nothing model, only one node can own and access a single hardware resource at any given time.  When failure occurs, a surviving node can take ownership of the failed node's resources and make them available to users.

While both Microsoft Cluster Server and the Windows Cluster Service support the shared nothing model, they can use the shared device model, but only if the clustered application supplies its own DLM.

Why Cluster?
Generally speaking, hardware failure is not the predominant cause of downtime.  The leading causes of downtime are typically related to events that are external to the system, such as misconfiguration, power outages, security breaches, and so forth.  Clustering cannot help you solve those types of problems.  In addition, a cluster cannot protect you from software incompatibilities, corrupt databases, viruses, catastrophes or mistakes.  Clustering is best implemented when a substantial proportion of your server downtime is caused by hardware failure.  If your organization’s leading cause of downtime is the result of failures in administration, software, or infrastructure, an investment in clustering technology may not reduce your downtime.

You first need to assess the reasons for server downtime in your organization, look at the problems that clustering solves, and then make a business decision as to whether clustering is an appropriate solution.  The primary focus of clustering is solving problems that arise from hardware failure, such as a blown CPU, bad memory, or the loss of an entire server. In addition, clustering allows you to continue providing resources during planned outages that may cause downtime for users. A cluster system can allow resources to be manually moved—or failed over—to one server while the other is brought down to perform a rolling upgrade, a configuration change, or other maintenance.

A rolling upgrade is the process of applying a service pack or other hardware or software update to each node in the cluster while the other node continues providing service. Rolling upgrades are typically a series of stages:

Then, repeat this process on each node in the cluster until the entire cluster is upgraded. Rolling upgrades are very attractive from a server management standpoint because services are only unavailable during the time it takes to move resources from one node to the other. By design, clusters help increase uptime. Increased uptime really means reduced downtime. Clustering can help reduce both planned and unplanned downtime. When any mission critical system fails, the consequences can include lost revenue, interruption of services to customers, and knowledge workers unproductively sitting idle. In organizations of all sizes, failures incur costs in many areas. Hidden costs often include damage to your reputation among customers, suppliers, and end-users; and the perception that your organization isn’t able to satisfy customer needs. Understanding the limitations of clustering is just as important as understanding the benefits. While clustering protects against the failure of a node in the cluster, it does not provide any protection against other problems, such as network failures, database corruption, loss of shared storage, or disasters.

Before implementing a cluster in your environment, you should evaluate whether this solution really solves enough of your problems to justify its cost. Clustering adds complexity to your environment and administration. Therefore, it is important that you understand and evaluate this technology in relation to your overall goals and the needs of your network.

Fault Tolerance Defined
Fault tolerance is the ability of a system to continue functioning when part of it fails (e.g., experiences a fault). This term is used to describe disk subsystems (e.g., RAID), symmetric multiple processors (SMP), redundant power supplies (with separate power sources), uninterruptible power supplies, redundant network adapters, etc.  Fault tolerance is designed to alleviate the problems caused by component failures, power outages, or other like occurrences.

Disk subsystems that use RAID, which stands for Redundant Array of Inexpensive Disks (or Redundant Array of Independent Disks, or Redundant Array of Inexpensive Devices, depending on who you ask) are considered fault tolerant.  RAID refers to the grouping of individual hard disks in a way that provides continued operation in the event of a disk failure.  There is both hardware RAID (e.g., a RAID controller is used) and software RAID (e.g., the functionality is provided by an operating system or application).  There are many forms (levels) of RAID:

There are other implementations of RAID, such as RAID-0+1 (aka RAID-10), RAID-2, RAID-3, etc., but these are typically proprietary implementations unique to the hardware manufacturer that support them.

High-Availability Defined
By definition, the goal of a highly available system is to provide continuous use of critical data and applications that keep businesses up and running, regardless of planned or unplanned interruption.  High-Availability refers to a system uptime that approaches 100%.  For example, an availability level of 99.999%, calculated on a round-the-clock basis, would mean that an organization would experience at least five minutes of unscheduled downtime per year.  A level of 99.99% translates to 52 minutes of downtime.  A level of 99.9% translates to 8.7 hours, and a level of 99% equals about 3.7 days of downtime per year.

The need for high-availability is not limited to 365x24x7 environments.  Many applications must be available during normal business hours or for a critical time periods throughout the day.  A system failure during these critical periods is unacceptable for many organizations.

Alternatives to Microsoft Cluster Server and Windows 2000 Cluster Services
See my links page for 3rd Party solutions.

Questions

If you have any questions, problems or comments related to this site or the information presented herein, please E-mail mePlease don't ask me for free technical support on clustering.  If you're looking for free technical support, check out the Microsoft public newsgroups.  You can find a link to the newsgroups Microsoft public newsgroups, and you can search them at http://groups.google.com.

 

Copyright Information
Microsoft is a registered trademark of Microsoft Corp. Windows is a trademark of Microsoft Corp. Other information contained herein is credited where and when appropriate to the original source and copyright holder.  All other information contained in this site is Copyright © 2004 Rodney R. Fournier -- All Rights Reserved.  This site and the information contained herein may not be reproduced in part or in whole without the express written consent of the author, Rodney R. Fournier.


Page last modified: August 21, 2005
Contents copyright Net Working America, Inc., 1996-2005. All rights reserved.