Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering

Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 2

February 12, 2013

Welcome to Part 2 of Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering. In Part 1, I discussed what Majority Node Set Clustering is and how it works with Exchange Site Resilience when you have one DAG member in a Primary Site and one DAG member in a Failover Site. In this Part, I will show an example of how Majority Node Set Clustering works with Exchange Site Resilience when you have two DAG members in a Primary Site and one DAG member in a Failover Site.

Part 1

Part 2

Part 3

Real World Examples

In Part 1, I showed a Real World example when you have one Exchange DAG member in the Primary Site and one Exchange DAG member in the Failover Site. In this Part, I am showing a Real World example when you have two Exchange DAG members in the Primary Site and one Exchange DAG member in the Failover Site.

3 Node DAG (Two in Primary and One in Failover)

In the following screenshot, we have 3 Servers. Two are Exchange 2010 Multi-Role Servers; one in the Primary Site and one on the Failover Site. The Cluster Service is running on all three Exchange Multi-Role Servers. More specifically, it would run on the Exchange 2010 Servers that have the Mailbox Server Role. When Exchange 2010 utilizes an even number of Nodes, it utilizes Node Majority with File Share Witness. Because we have an odd number of Nodes, we are utilizing Node Majority and will not utilize a File Share Witness.

So now we have our three Servers; all three of them being Exchange. This means we have three voters and do not need a File Share Witness as we have a third node. So the question is, how many voters/servers/cluster objects can I lose? Well if you read the section on Majority Node Set (which you have to understand), you know the formula is (number of nodes /2) + 1. This means we have (3 Exchange Servers / 2) rounded down = 1 + 1 = 2. This means that 2 cluster objects must always be online for your Exchange Cluster to remain operational just like if we were utilizing 2 DAG members with a File Share Witness.

But now let’s say one of your Exchange Servers go offline. Well, you still have at least two cluster objects online. This means your cluster will be still be operational. If all users/services were utilizing the Primary Site, then everything continues to remain completely operational. If you were sending SMTP to the Failover Site or users were for some reason connecting to the Failover Site, they will need to be pointed to the Exchange Server in the Primary Site.

But what happens if you lose a second node? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times. At this time, the entire cluster goes offline. You need to go through steps provided in the site switchover process but in this case, you would be activating the Primary Site and specify a new Alternative File Share Witness Server that exists in the Primary Site so you can active the Exchange 2010 Server in the Primary Site. The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational. And once you re-add the second DAG node, you now have two DAG members in the DAG which will want to switch the DAG Cluster into a Node Majority with File Share Witness which is why you need to still specify a File Share Witness.

But what happens if you lose two nodes in the Primary Site? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times. At this time, the entire cluster goes offline. You need to go through steps provided in the site switchover process but in this case, you would be activating the Failover Site and specify a new Alternative File Share Witness Server that exists (or will exist) in the Failover Site so you can activate the Exchange 2010 Server in the Primary Site. The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

Once the Datacenter Switchover has occurred, you will be in a state that looks as such. An Alternate File Share Witness is not for redundancy for your 2010 FSW that was in your Primary Site. It’s used only during a Datacenter Switchover which is a manual process.

Once your Primary Site becomes operational, you will re-add the Primary DAG Server to the existing DAG which will still be using the 2010 Alternate FSW Server in the Failover Site and you will now be switched into a Node Majority with File Share Witness Cluster instead of just Node Majority. Remember I said with an odd number of DAG Servers, you will be in Majority Node Witness and with an even number, the Cluster will automatically switch itself to Node Majority with File Share Witness? You will now be in a state that looks as such.

Part of the Failback Process would be to switch to a FSW Server in the Primary Site. Once done, you will be back into your original operational state.

Now the final step of the Failback Process would be to re-add your final remaining DAG Member in the Primary Site. Once done, your cluster will switch back into a Node Majority Cluster and will no longer be utilizing the FSW.

As you can see with how this works, the question that may arise is where to put your the majority of your Exchange DAG Members? Well, it should be in the Primary Site with the most users or the site that has the most important users. With that in mind, I bet another question arises? Well, why with the most users or the most important users? Because some environments may want to use the above with an Active/Active Model instead of an Active/Passive. Some databases may be activated in both sites. But, with that, if the WAN link goes down, the Exchange 2010 Server in the Failover Site loses quorum since it can’t contact at least 1 other cluster object. Again, you must have two cluster objects online. This also means that each cluster object must be able to see one other cluster object. Because of that, the Exchange 2010 Server will go completely offline.

To survive this, you really must use 2 different DAGs. One DAG where the majority of your Exchange 2010 DAG Members is in the First Site and a second DAG where the majority of the Exchange 2010 DAG Members is in the Second Site. Users that live in the First Active Site would primarily be using the Exchange 2010 DAG Members in the First Active Site. Users that live in the Second Active Site would primarily be using the Exchange 2010 DAG Members in the Second Active Site. This way, if anything happens with the WAN link, users in the First Active Site would still be operational as the majority of its Exchange 2010 DAG Members for their DAG is in the First Active Site and DAG 1 would maintain Qourum. Users in the Second Active Site would still be operational as the majority of its Exchange 2010 DAG Members for their DAG is in the Second Active Site and DAG 2 would maintain Quorum.

Note: This would require twice the amount of servers since a DAG Member cannot be a part of more than one DAG. As shown below, each visual representation below of a 2010 HUB/CAS/MBX is a separate server.

The Multi-DAG Model would look like this.

Tagged with:

Categorised as: Exchange, Microsoft

Memorise