StarWind Enterprise Server
Request a Quote
Feel free to contact us:
USA and LATAM:
1-617-449-7717
EMEA and APAC:
+44(0)2071936727
+44(0)2071936350
+33(0)977197857 (French)
+49-1715109103 (Germany)
Voice Mail:
1-866-790-2646
Email:
sales@starwindsoftware.com
 
 

Recovery time during an iScsi outage

iSCSI Target for Microsoft Windows.

Moderators: anton (staff), Max (staff), Constantin (staff)

Recovery time during an iScsi outage

Postby lbroyles » Mon Dec 12, 2011 4:37 pm

I've been having problems where one of the nodes will experience iScsi errors on the sync channel and cause the partner node to go offline and be out of sync. I've been working on this problem with Max. Right now it only happens about once every 3 weeks or so (so it is getting better) ... but when it does happen, it causes major problems because the HA is not recovering like I thought it would.

The issue that I have is when this happens, all of my VM's on my 3 ESXi servers stop. The console shows that the VMs are running, but no user can connect to them. Email is offline, SqlServer doesn't respond, file servers are offline and so forth. Then after about 5-10 minutes (which when everything halts in a production environment seems like forever), most of the VM's will recover and start functioning right where they left off. Exchange and Sql Server and so forth just start going again like nothing had happened without restarting. A handful of the file servers will need to be restarted because they experience a system fault.

So my question is why do all the VM's have an issue when the partner node goes offline in this manner? Why doesn't the system simply recover and continue with very little or no impact on the users? Is there some sort of setting in ESXi that is waiting for a timeout to occur that I can fix?

Thanks,
Loren.
lbroyles
 
Posts: 1
Joined: Mon Dec 05, 2011 4:30 pm

Re: Recovery time during an iScsi outage

Postby anton (staff) » Tue Dec 13, 2011 10:52 pm

I'll discuss with Max your case in a couple of hours (he's just back from trip to Germany). Initially it looks we need to find and fix 1) errors on sync channel (if they are repeatable they should be pinpointed and killed, I'm pretty much sure it's broken switch or cable or NIC going crazy) and 2) why primary storage stops responding if partner goes AWOL. Second one sounds like either our issue or it's still something with network so somehow related to the first case. It should not do what it does now...

lbroyles wrote:I've been having problems where one of the nodes will experience iScsi errors on the sync channel and cause the partner node to go offline and be out of sync. I've been working on this problem with Max. Right now it only happens about once every 3 weeks or so (so it is getting better) ... but when it does happen, it causes major problems because the HA is not recovering like I thought it would.

The issue that I have is when this happens, all of my VM's on my 3 ESXi servers stop. The console shows that the VMs are running, but no user can connect to them. Email is offline, SqlServer doesn't respond, file servers are offline and so forth. Then after about 5-10 minutes (which when everything halts in a production environment seems like forever), most of the VM's will recover and start functioning right where they left off. Exchange and Sql Server and so forth just start going again like nothing had happened without restarting. A handful of the file servers will need to be restarted because they experience a system fault.

So my question is why do all the VM's have an issue when the partner node goes offline in this manner? Why doesn't the system simply recover and continue with very little or no impact on the users? Is there some sort of setting in ESXi that is waiting for a timeout to occur that I can fix?

Thanks,
Loren.
Regards,
Anton Kolomyeytsev

Chief Technology Officer, StarWind Software

Image
User avatar
anton (staff)
Site Admin
 
Posts: 2153
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands


Return to StarWind

Who is online

Users browsing this forum: Google [Bot] and 5 guests