How to avoid a full resync after a node reboot?

MichelZ · Thu Feb 05, 2015 12:56 pm

We had the following situation:

- Windows Update on a node
- We did try to stop the starwind service - did not stop for >1h, so we decided to reboot the host
- After a reboot, a Full Synchronization started

How can we avoid this situation in the future? Does it really take 1h+ to stop the starwind service?
We are using Version 8.0.7509

thefinkster · Thu Feb 05, 2015 2:12 pm

I assume two nodes? How much space provisioned into how many images/targets?

I've found that having about 5.5TB provisioned into about 20 images of various sizes (none over 1TB); it will take the SW service on build 8.0.7145 about 30 minutes to shut off. This is with a two node system; no async targets. The two nodes are running 4 x 10gigabit (40gb) for sync; a single GigE for heartbeat; and each VMWare host is connected to each Starwind by 10gb; providing 20gb of total bandwidth for all VMs. (2 VMWare hosts, 2 StarWind nodes).

We will (Starwind will really) need to know your exact setup. What controller, and what size/kind/number of drives are plugged in? RAID1, 10, 5, 6, 50, 60, 0? What kind of CPUs; how much RAM? How much L2 cache (I don't believe size of L1 affects startup/shutdown speed; at least in my testing up to 4GB L1 cache); but L2 size certainly does; though in my testing very minimally (seconds difference; as it should be with an SSD).

I think 7509 has a known latency bug that I believe would definitely cause further delays; perhaps significant; to shutting down the StarWind service. StarWind can confirm this.

Also; with any clustered-node shutdown procedure; did you check to see if the first node you went to shut off was the master node or the slave? You have to check each image created unfortunately. I always transfer the sync to the node that's remaining online to guarantee sync status; and this usually runs a fast and not a full. If it runs a full; then I can at least watch the status before turning off the service and wondering what is going on.

MichelZ · Thu Feb 05, 2015 2:29 pm

Sure, here my specs:

- 2 Images, 1 TB each
- 2 Nodes, Sync
- Currently, 4x 1 GBit
- Network usage was zero pretty quickly after shutdown
- I did not check which node was Master
- 12.5 GB L1 Cache on each device

Node 1 (the one who I tried to shut down)
- 2x E5620 (2.4GHz, 8 Cores each)
- 32 GB RAM
- LSI 9271-8i with 512 MB RAM & BBU
- 24x Intel SSD 250 GB (INTELSSDSC2CW24)
- 2x Raid 10
- 4x 1 GBit (All trunked and VM & Sync traffic going over it)

Node 2
- 1x E5-2670 v3 (2.30 GHz, 12 Cores)
- 192 GB RAM
- LSI 9361-i8 with 1 GB RAM & BBU
- 24x Samsung 850 Pro 1 TB
- 1x Raid 60
- 4x 1 GBit (All trunked and VM & Sync traffic going over it)

Due to some other circumstances we cannot currently use 10 GBit, this will change in the future.
I don't think it affects our current problem, as we don't see network traffic when stopping the service.

Let me know if you need anything else

Thanks!
Michel

thefinkster · Thu Feb 05, 2015 4:45 pm

From looking at the setup; the only thing I find would be the culprit would be the 4 x 1gb; running VM traffic and SW Sync/Heartbeat. VM Traffic should be seperate from SW Sync.

Are these individual links or did you team them together? Avoid teaming on the Starwind systems; and use multiple subnets/VLANs to guarantee network segregation. If you can avoid it; avoid using switches; or make sure they are fully managed and able to use 9k Jumbo frames.

: StarwindDiagram.jpg (105.13 KiB) Viewed 5635 times

You can see in my diagram the individual subnets for every connection. You could probably Team outside Starwind virtual and present one teamed NIC to the VM; but I have not had time to test this scenario.

You'll also want to tune TCP as well for your specific NICs (TCP Offload, Chimney, RSS Queues, etc.)

Basically break your system components into individual segments and validate that they are working as well as they can.

Iperf will test NIC/bus speed on the systems; but not hard drive/IO. You'll want to use IOMeter to test IO on the systems.

So you test local system items (storage; nics; etc). Then test cross-system NICs. Then test locally attached iSCSI. Then test remotely attached iSCSI. Usually in that order; since testing remotely is worthless if local does not work properly or at full potential.

MichelZ · Thu Feb 05, 2015 4:48 pm

These are all valid points, but there was zero network usage and zero disk usage after shutting down the server....

thefinkster · Thu Feb 05, 2015 5:01 pm

MichelZ wrote:These are all valid points, but there was zero network usage and zero disk usage after shutting down the server....

Zero disk usage on the first node after you attempted to stop the StarWind service; and there was 0 disk and network activity for the hour it took before you did the hard shutdown?

I'd assume this was related to 7509 latency issues with L1 cache. You can disable L1 cache and restart StarWind and attempt stop and starts of the service to see if the 7509 latency issue was the culprit.

Also; when my StarWind is shutting down; there is definite Disk activity under Resource Monitor as the StarWind service does its thing... so this surprises me if this is the case. You may want to do a wireshark/packet capture; and make sure that there wasn't any TCP issues causing lots of timeout/retransmissions that shows as basically 0 network usage. At the very least, you'd see broadcasts hitting your NIC even if no other traffic presented. In other words; be absolutely sure about what is going on in the back-end; otherwise we'll just be making assumptions. I assume there is nothing going on here; but better to rule it out so you don't have to the question lingering.

You may want to start a ticket with support; they will run through your various scenarios and gather logs to help identify the issue; and hopefully resolve it.

Also, 7145 seems to be the most stable so far; so unless you need the LSFS fixes in the latest releases; you may see about getting a more stable build.

MichelZ · Fri Feb 06, 2015 5:35 am

Starwind support has advised us that this is a known issue and the workaround would be to disable L1 cache.
A fixed version is due "in a few weeks"

Thanks
Michel

Wed Feb 11, 2015 12:19 pm

I can confirm that. Two weeks should be enough for us to improve this.