System comes to a grinding halt on Sync

MichelZ · Thu Jan 22, 2015 4:32 pm

Hi

While re-syncing a HA cluster (we had some trouble with one node), all client-facing connections seem to come to a halt. VM's having 100% Disk usage and Disk Queue >10.
Performance shows ~1000 IOPS / 27 MB/s.

NIC's are currently 2x 1 GB, however, we are seeing a network bandwidth issue of about ~700 Mbps.
It does not help when setting the slider down for client request priority (bandwidth drops to 300 - 500 Mbps, no change in VM performance - still slow)

Storage:
24x SSD with LSI controller (both arrays) and BBU
Windows 2012 R2

StarWind Version 8.0.7509
Normal HA Device (not LSFS)

VMWare ESX 5.5

Disks in Disk Monitor are idle on the arrays (3-4% with <0.1 disk queue)

This does not make a lot of sense to me.
I do understand that a dedicated 10 GBit replication channel is recommended, and we're working on this. I would however still expect that at least 1x 1 GBit link could be saturated and/or lowering the sync priority to have near-normal client VM performance...

Any ideas?

thefinkster · Thu Jan 22, 2015 6:42 pm

I find this happens in only one scenario: You had to tell one of the nodes that wasn't considered the master that this image IS the one to be the source for synchronization. Can you elaborate on what specifically happened and the steps you did to rectify (in your mind) the synchronization? The chronology of events; when/what you clicked will be somewhat important; did the sync happen automatically or did you have to force it; etc.

So far in my experience both production and lab; this happens in any hard-down situation I have EVER encountered. It does not happen 100%; and I have yet to find a reliable way to duplicate when it works (like 1% of the time) versus making it fail (99% of the time).

Our internal policies are built such that we will not attempt a "force sync" (which invariably results in a full and not fast sync) until after hours/production is no longer impacted. The risks are just too great to kill the iSCSI access to the system; or bring it to a damned crawl. In VMWare; you don't actually see any sort of disconnection; but with Microsoft's iSCSI Initiator; the devices will be listed as "reconnecting..." and there's no way to force them to connect until the synchronization is finished.

My two production nodes are 40gbit (4x10gbit) synchronization; with dedicated 10gb to each host; direct connected with no switch. (Each Starwind system has 3 x 2-port 10gbE Broadcom NICs.) Even with full synchronization; I should never lose access to the "online" node; especially since sync is 4x the size of client access. I've just been too busy to test to bring this back up so we can see if we can fix this; as well as providing testing parameters to show Starwind how to duplicate it reliably.

Also, StarWind will definitely work with you on the issue (remote-in to the system and help with testing/logging/etc); assuming you have support. They may even help without support; but just be patient with them.

MichelZ · Thu Jan 22, 2015 6:55 pm

Yes, I had a situation where I accidentally took down the wrong node and needed to force a full re-sync with manual master selection.
It still strikes me as very odd that nothing seems to be overloaded (network, disk, cpu, ram), but the connections to the hosts are still sluggish/slow.

If that is the normal behavior during full sync (and full sync is not expected to happen often), then I'm absolutely fine with that...

thefinkster · Thu Jan 22, 2015 7:42 pm

Yea. I caution anyone with StarWind to never, ever, plan on taking both nodes down at the same time.. EVER. So does StarWind.

MichelZ · Thu Jan 22, 2015 7:43 pm

I know, it was an accident...

Mon Jan 26, 2015 8:19 pm

Yea. I caution anyone with StarWind to never, ever, plan on taking both nodes down at the same time.. EVER. So does StarWind.

Yup, we do