StarWind Enterprise Server
Request a Quote
Feel free to contact us:
USA and LATAM:
1-617-449-7717
EMEA and APAC:
+44(0)2071936727
+44(0)2071936350
+33(0)977197857 (French)
+49-1715109103 (Germany)
Voice Mail:
1-866-790-2646
Email:
sales@starwindsoftware.com
 
 

HA Network Failover Problem

iSCSI Target for Microsoft Windows.

Moderators: anton (staff), Max (staff), Constantin (staff)

HA Network Failover Problem

Postby jeffhamm » Tue Sep 20, 2011 9:56 pm

We have a 2 node HA set running the latest build, and have two clustered HyperV hosts attached to the storage using CSVs. I have MPIO setup on the HyperV hosts. When I stop or start the StarWind Service on one of the nodes, the Virtual Machines keep running without any interruption in service.

But where we are having issues is when we simulate a complete network failure on one of the two StarWind nodes. If I disable all the network interfaces on one of the two nodes via a batch script (netsh disable...), the virtual machine freezes, and the LUNs completely disappear from both HyperV nodes. I then have to reboot all 4 boxes to get things running again. Obviously not a good situation.

I have tested and am able to replicate the above every time. Where would be a good place to start troubleshooting this issue?

Thanks,
Jeff
jeffhamm
 
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Re: HA Network Failover Problem

Postby anton (staff) » Tue Sep 20, 2011 10:03 pm

You're turning OFF all channels so putting down heartbeat as well. So node you're leaving alone has "slave" token and puts itself down to avoid split brain issue. You can toggle this behaviour but it's not recommended as you'll face SB sooner or later.
Regards,
Anton Kolomyeytsev

Chief Technology Officer, StarWind Software

Image
User avatar
anton (staff)
Site Admin
 
Posts: 2153
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands

Re: HA Network Failover Problem

Postby jeffhamm » Wed Sep 21, 2011 5:48 pm

So by default, if the heartbeat goes down for any reason (nic down, one node blue screens, etc ) the whole SAN goes offline?
jeffhamm
 
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Re: HA Network Failover Problem

Postby anton (staff) » Thu Sep 22, 2011 9:00 am

NO! By default if all links between HA nodes will go down (ALL means multiple synchronization channels and multiple heartbeat channels as well) node holding "slave" token will turn itself OFF to avoid split brain issue. With properly configured cluster (heartbeat routed thru initiator side subnetwork) you have ZERO chances to see both nodes down.

P.S. The only way to avoid such an issue completely is going to multiple HA nodes. More then two. Then we'll have a voting quorum. And we'll represent such a solution quite soon. So stay tuned :)

jeffhamm wrote:So by default, if the heartbeat goes down for any reason (nic down, one node blue screens, etc ) the whole SAN goes offline?
Regards,
Anton Kolomyeytsev

Chief Technology Officer, StarWind Software

Image
User avatar
anton (staff)
Site Admin
 
Posts: 2153
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands

Re: HA Network Failover Problem

Postby jeffhamm » Thu Sep 22, 2011 1:36 pm

But would not all the heartbeat networks go down if the StarWind node had a Blue Screen of Death? If the node that had the BSOD is the one holding the "Primary" token for all LUNs, does not the entire SAN still go down at that point?
jeffhamm
 
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Re: HA Network Failover Problem

Postby rchisholm » Thu Sep 22, 2011 2:09 pm

Will the additional member of the voting quorum have to be a storage node? It would be great if it could just provide a quorum. In my situation, if it has to be a 3rd storage node, it increases my costs greatly. With 100's of TB's, the cost of the drives, controllers, servers, rack space, power, and cooling makes a big difference for a 3rd server.

anton (staff) wrote:NO! By default if all links between HA nodes will go down (ALL means multiple synchronization channels and multiple heartbeat channels as well) node holding "slave" token will turn itself OFF to avoid split brain issue. With properly configured cluster (heartbeat routed thru initiator side subnetwork) you have ZERO chances to see both nodes down.

P.S. The only way to avoid such an issue completely is going to multiple HA nodes. More then two. Then we'll have a voting quorum. And we'll represent such a solution quite soon. So stay tuned :)

jeffhamm wrote:So by default, if the heartbeat goes down for any reason (nic down, one node blue screens, etc ) the whole SAN goes offline?
rchisholm
 
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Re: HA Network Failover Problem

Postby jeffhamm » Thu Sep 22, 2011 5:55 pm

Anton - I think I can live with the Split Brain scenario for now. How do you change the default setting to allow the Slave to stay online?

Thanks,
Jeff
jeffhamm
 
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Re: HA Network Failover Problem

Postby anton (staff) » Thu Sep 22, 2011 8:28 pm

I think you cannot but whatever it's your data. Please drop a message to support@starwindsoftware.com so guys could help you. I don't want to publish "bad" advices on public :)

jeffhamm wrote:Anton - I think I can live with the Split Brain scenario for now. How do you change the default setting to allow the Slave to stay online?

Thanks,
Jeff
Regards,
Anton Kolomyeytsev

Chief Technology Officer, StarWind Software

Image
User avatar
anton (staff)
Site Admin
 
Posts: 2153
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands

Re: HA Network Failover Problem

Postby jeffhamm » Thu Sep 22, 2011 8:51 pm

I totally get not wanting to hand out "bad advise", but let me explain my idea:

- Set the StarWind Service to Manual instead of Auto on both nodes
- If my primary node goes down hard, the slave continues to run and service requests to virtual machines
- When my primary comes back online, StarWind service does not, so I can avoid data corruptions issues.

Does this make sense or is it crazy?
jeffhamm
 
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Re: HA Network Failover Problem

Postby anton (staff) » Thu Sep 22, 2011 9:30 pm

No. But you've messed whole thing up. "No network connection between nodes" and "One node went down" are different things. We do distinguish them and process differently. Your setup is SECOND (nodes going down) and you're talking about FIRST.
Regards,
Anton Kolomyeytsev

Chief Technology Officer, StarWind Software

Image
User avatar
anton (staff)
Site Admin
 
Posts: 2153
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands

Re: HA Network Failover Problem

Postby jeffhamm » Thu Sep 22, 2011 9:49 pm

OK - I get it that split brain is bad and will stop talking about that :)

What I'm trying to simulate is a situation where one of the two nodes goes down hard. I thought I could do this by just disabling all the network connections on the primary node. I'm guessing what you are trying to tell me is that this is a bad test?

Would a better test be for me to just push the reset button on the primary node? And if I do reset the primary node, is the expected behavior that the slave node will continue to service requests from virtual machines, or that the slave node will stop servicing requests at that point to avoid split brain?

Sorry to be such a pain - we're close to going into production with our Hyper-V cluster, and we need to make sure we have all the fail over scenarios accounted for and procedures for dealing with them in place if (or when) they occur.

Thanks!
Jeff
jeffhamm
 
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Re: HA Network Failover Problem

Postby anton (staff) » Thu Sep 22, 2011 9:58 pm

Yes it's a bad test. Turning node OFF and disabling all of its connections are two different things.

If node is dead other one would pick up it's work. Continue process requests.

There's no Master and Slave. There only Master and Slave token to process split brain issue. In all other things nodes are equal.

P.S. It does not mean you don't have something broken. So *DO* experiment with turning nodes on and off and resyncing everything BEFORE putting the whole thing to production. That's wise indeed.

jeffhamm wrote:OK - I get it that split brain is bad and will stop talking about that :)

What I'm trying to simulate is a situation where one of the two nodes goes down hard. I thought I could do this by just disabling all the network connections on the primary node. I'm guessing what you are trying to tell me is that this is a bad test?

Would a better test be for me to just push the reset button on the primary node? And if I do reset the primary node, is the expected behavior that the slave node will continue to service requests from virtual machines, or that the slave node will stop servicing requests at that point to avoid split brain?

Sorry to be such a pain - we're close to going into production with our Hyper-V cluster, and we need to make sure we have all the fail over scenarios accounted for and procedures for dealing with them in place if (or when) they occur.

Thanks!
Jeff
Regards,
Anton Kolomyeytsev

Chief Technology Officer, StarWind Software

Image
User avatar
anton (staff)
Site Admin
 
Posts: 2153
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands

Re: HA Network Failover Problem

Postby hixont » Fri Sep 23, 2011 12:19 am

rchisholm wrote:Will the additional member of the voting quorum have to be a storage node? It would be great if it could just provide a quorum. In my situation, if it has to be a 3rd storage node, it increases my costs greatly. With 100's of TB's, the cost of the drives, controllers, servers, rack space, power, and cooling makes a big difference for a 3rd server.

anton (staff) wrote:P.S. The only way to avoid such an issue completely is going to multiple HA nodes. More then two. Then we'll have a voting quorum. And we'll represent such a solution quite soon. So stay tuned :)


I would like to echo this concern. For SQL database servers that are mirrored (functionally what the SAN servers are doing) I only need a witness server that doesn't have to be a fully kitted out production SQL server. I can see reasons why I would want a fully functional three legged HA SAN configuration (offsite replication/failover for instance), but I can see equal validity in just having a witness server in place to act as a quorum voter. Would it be possible to have both options available? I haven't got the budgets to absorb another fully provisioned SAN server.
hixont
 
Posts: 25
Joined: Fri Jun 25, 2010 9:12 pm

Re: HA Network Failover Problem

Postby anton (staff) » Fri Sep 23, 2011 6:56 am

1) We'll do support all of the listed scenarios. So you're not going to be forced to pick up working way.

2) What hypervisor do you run at this moment?

hixont wrote:
rchisholm wrote:Will the additional member of the voting quorum have to be a storage node? It would be great if it could just provide a quorum. In my situation, if it has to be a 3rd storage node, it increases my costs greatly. With 100's of TB's, the cost of the drives, controllers, servers, rack space, power, and cooling makes a big difference for a 3rd server.

anton (staff) wrote:P.S. The only way to avoid such an issue completely is going to multiple HA nodes. More then two. Then we'll have a voting quorum. And we'll represent such a solution quite soon. So stay tuned :)


I would like to echo this concern. For SQL database servers that are mirrored (functionally what the SAN servers are doing) I only need a witness server that doesn't have to be a fully kitted out production SQL server. I can see reasons why I would want a fully functional three legged HA SAN configuration (offsite replication/failover for instance), but I can see equal validity in just having a witness server in place to act as a quorum voter. Would it be possible to have both options available? I haven't got the budgets to absorb another fully provisioned SAN server.
Regards,
Anton Kolomyeytsev

Chief Technology Officer, StarWind Software

Image
User avatar
anton (staff)
Site Admin
 
Posts: 2153
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands

Re: HA Network Failover Problem

Postby hixont » Fri Sep 23, 2011 4:32 pm

anton (staff) wrote:1) We'll do support all of the listed scenarios. So you're not going to be forced to pick up working way.
Thanks.

anton (staff) wrote:2) What hypervisor do you run at this moment?
I am a Hyper-V (Windows 2008 R2) shop and bounce between Hyper-V Manager, Failover Cluster Manager and SCVMM 2008 R2 as my management consoles.
hixont
 
Posts: 25
Joined: Fri Jun 25, 2010 9:12 pm

Next

Return to StarWind

Who is online

Users browsing this forum: No registered users and 5 guests