Data Locality

Hyper-Converged Infrastructure Production Record: 6.7 million IOPS. 51% of the theoretical limit

12-node cluster StarWind HyperConverged Appliance (HCA) delivers 6.7 million IOPS powered by Intel® Xeon® Platinum 8268 processors and Intel® Optane™ SSD DC P4800X Series drives and Mellanox ConnectX-5 100GbE NICs based on Supermicro SuperServers all connected to Mellanox SN2700 Spectrum™ switch using Mellanox LinkX® copper cables.

Introduction

We offer the StarWind HyperConverged Appliance (HCA) for SMB and ROBO who are looking for ways to curtail application downtime but are limited on IT team resources and/or budgets. StarWind HCA is a 100% software-defined hyperconverged platform built with Dell® OEM or StarWind-branded server platform: we use Supermicro SuperServers. This article describes the first of three benchmark stages. It shows the 12-node production-ready StarWind HCA cluster performance.

Hardware

Each node was powered with Intel® Xeon® Platinum 8268 processors, 2x Intel® Optane™ SSD DC P4800X Series drives, and Mellanox ConnectX-5 100GbE NICs connected with Mellanox LinkX® copper cables to 2 Mellanox SN2700 Spectrum™ switches. Generally speaking, we used the standard StarWind HyperConverged Appliances (Supermicro SuperServer chassis) where only CPUs were upgraded for chasing the HCI industry record.

Software

In our setup, StarWind HCA had the fastest software running: Microsoft Hyper-V (Windows Server 2019) and StarWind Virtual SAN service application runs in Windows user-land. StarWind supports polling, in addition to avoid interrupt-driven IO, to reduce latency and to boost performance by turning CPU cycles into IOPS. We also developed TCP Loopback Accelerator to bypass TCP stack on the same machine and iSCSI load-balancer to assign new iSCSI sessions to new CPU core thus fixing issues in aging Microsoft iSCSI Initiator.

On our website, you can learn more about hyperconverged infrastructures powered by StarWind Virtual SAN.

12-node StarWind HyperConverged Appliance cluster specifications:

Platform: Supermicro SuperServer 2029UZ-TR4+
CPU: 2x Intel® Xeon® Platinum 8268 Processor 2.90 GHz. Intel® Turbo Boost ON, Intel® Hyper-Threading ON
RAM: 96GB
Boot Storage: 2x Intel® SSD D3-S4510 Series (240GB, M.2 80mm SATA 6Gb/s, 3D2, TLC)
Storage Capacity: 2x Intel® Optane™ SSD DC P4800X Series (375GB, 1/2 Height PCIe x4, 3D XPoint™). The latest available firmware installed.
RAW capacity: 9TB
Usable capacity: 8.38TB
Working set capacity: 4.08TB
Networking: 2x Mellanox ConnectX-5 MCX516A-CCAT 100GbE Dual-Port NIC
Switch: 2x Mellanox SN2700 32 Spectrum ports 100GbE Ethernet Switch

The diagram below illustrates the servers’ interconnection.

Interconnection diagram
Interconnection diagram

NOTE: On every server, each NUMA node has 1x Intel® SSD D3-S4510, 1x Intel® Optane™ SSD DC P4800X Series and 1x Mellanox ConnectX-5 100GbE Dual-Port NIC. Such configuration enabled to squeeze maximum performance out of each piece of hardware. Such connection is rather a recommendation than a strict requirement. To obtain similar performance, no tweaking NUMA node configuration is required, meaning that the default settings were OK.

Software Setup

Operating system. Windows Server 2019 Datacenter Evaluation version 1809, build 17763.404 was installed on all nodes with the latest updates available on May 1, 2019. With performance perspective in mind, the power plan was set to High Performance, and all other settings, including the relevant side-channel mitigations (mitigations for Spectre v1 and Meltdown were applied), were left at default settings.

Windows Installation. Hype-V Role, MPIO, and Failover Cluster features
To make the deployment process faster, we made an image of Windows Server 2019 with Hype-V Role installed and MPIO and Failover Cluster features enabled. Later, the image was deployed on 12x Supermicro servers.

software setup

Driver installation. Firmware update
Once Windows installed, for each hardware piece hardware Windows Update were applied, and firmware updates were installed for Intel NVMe SSDs.

StarWind Virtual SAN. Current production-ready StarWind VSAN version (8.0.0.12996) was installed on each server. The entire cluster had 8.38TB usable capacity from 9TB RAW capacity. Microsoft recommends creating at least one Cluster Shared Volume per server node. Therefore, for 12 servers, we created 12 volumes with ReFS.

ReFS shows us convenient performance overtopping one we had using the NTFS file system. Each volume had 340GB capacity and used two-way mirror resiliency, with allocation delimited to two servers. For particular working set we ended with 4.08 GB of total usable storage. All other settings, like columns and interleave, were default. To accurately measure IOPS to persistent storage only, the in-memory CSV read cache was disabled.

software setup

StarWind iSCSI Accelerator (Load Balancer). We used built-in Microsoft iSCSI Initiator together with our own user-mode iSCSI initiator for. Microsoft iSCSI initiator was developed in the “stone age”, when servers had one- or two-socket CPUs with a single core per socket. Having more powerful servers nowadays, the Initiator does not work as it should.

CPU Core load diagram
CPU Core load diagram

So, we developed iSCSI Accelerator as a filter driver between Microsoft iSCSI Initiator and the network stack. Every time a new iSCSI session is created, it is assigned to a free CPU core. Therefore, the performance of all CPU cores will be used uniformly, and latency approaches zero. Distributing workloads in such way ensures smart compute resource utilization: no cores are overwhelmed while others are idle.

Accelerated CPU Core load diagram
Accelerated CPU Core load diagram

StarWind iSCSI Accelerator (Load Balancer) was installed on each cluster node in order to balance virtualized workloads between all CPU cores in Hyper-V servers.

StarWind Loopback Accelerator. As a part of StarWind Virtual SAN, StarWind Loopback Accelerator was installed and configured in order to significantly decrease latency times and CPU load for cases when Microsoft iSCSI Initiator connects to StarWind iSCSI Target over the loopback interface. This is zero-copy memory in loopback mode, thus most of TCP stack bypassed.

NOTE: Due to the fast path provided by StarWind Loopback Accelerator, each iSCSI LUN had 2 loopback iSCSI session and 3 external partner iSCSI sessions. Least Queue Depth (LQD) MPIO policy was set. This policy maximizes bandwidth utilization, automatically utilizes the active/optimized path with the smallest current outstanding queue.

iSCSI sessions interconnection diagram
iSCSI sessions interconnection diagram


Block iSCSI/iSER (RDMA). Like StarWind HyperConverged Appliances, 12-node HCI cluster features Mellanox NICs and switches. In this study , StarWind Virtual SAN utilized iSER for backbone links to RDMA, delivering maximum possible performance.

NOTE: Windows Server 2019 doesn’t have iSER (RDMA) support just yet. Lack of all RDMA put pressure on CPU affecting and limiting performance. Ignoring local Windows TCP/IP stack overhead, StarWind’s built-in user-land iSER initiator is used for data and metadata synchronization, and acknowledgment “guest” writes over iSER (RDMA). The latency time on data-in-memory copy operations is higher than we expect. Hence, the 4K IO blocks’ performance results are lower than we can obtain from storage.

Accelerating IO Performance bandwidth Accelerating IO Performance IOPS
Accelerating IO Performance

As a result, the IO performance was accelerated with a combination of RDMA, DMA in loopback, and TCP connections.

NUMA node. Taking into account NUMA node configuration on each cluster node, to replicate shared storage between two servers, the virtual disk was configured using a network adapter located on the same NUMA node as it was located on. For example, on cluster node 3, the virtual disk was created on Intel Optane SSD and was located on NUMA node 1. So, for disk mirroring, Mellanox ConnectX-5 100GbE NIC located on NUMA node 1 also.

NUMA node assignment diagram
NUMA node assignment diagram

CSV. For 12-node hyperconverged cluster, 12 Cluster Shared Volumes were created on top of 12 synchronously-mirrored StarWind virtual disks according to Microsoft recommendations.

software setup

Hyper-V VMs. Empirically generalized, we took 12 virtual machines x 2 virtual processors each = 24 virtual processors to saturate all the performance. That’s 144 total Hyper-V Gen 2 VMs across the 12 server nodes. Each VM runs Windows Server 2019 Standard and was assigned 2 GiB of memory.

software setup

NOTE: NUMA spanning was disabled to ensure that virtual machines always ran with optimal performance according to the known facts about NUMA spanning.

Benchmarking

In virtualization and hyperconverged infrastructures, it’s common to judge on performance based on the number of input/output (I/O) operations per second, or “IOPS” – essentially, the number of reads or writes that virtual machines can perform. A single VM can generate a huge number of either random or sequential reads/writes. In real production environments, there usually are tons of VMs and that makes the data flow fully randomized. 4 kB block-aligned IO is a block size that Hyper-V virtual machines use, so it was our IO pattern of choice.

In the industry, hardware and software vendors often use this very type of pattern. In other words, they basically measure the best possible performance under the worst circumstances.

In this set of articles, we not only performed the same tests as Microsoft but also benchmarked performance under other IO patterns that are common for production environments.

VM Fleet.We used the open-source VM Fleet tool available on GitHub. VM Fleet makes it easy to orchestrate DISKSPD, the popular Windows micro-benchmark tool, in hundreds or thousands of Hyper-V virtual machines at once.

Let’s take a closer look at DISKSPD settings. To saturate environment performance, we set 2 threads per file (-t2). Taking into account Intel recommendations, the given number of threads was used for numerous storage IO tests. As a result, we got the highest storage performance in the saturation point under 32 outstanding IOs per thread (-o32). To disable the hardware and software caching, we specified unbuffered IO (-Sh). We specified -r for random workloads and -b4K for 4 kB block size. We varied the read/write proportion by the -w parameter.

In summary, here’s how DISKSPD was being invoked: .\diskspd.exe -b4 -t2 -o32 -w0 -Sh -L -r -d900 [...]

StarWind Command Center. Designed as an alternative to Windows Admin Center and bulky System Center Configuration Manager, StarWind Command Center consolidates sophisticated dashboards that provide all the important information about the state of each environment component on a single screen.

Being a single-pane-of-glass tool, StarWind Command Center enables to solve the whole range of tasks on managing and monitoring your IT infrastructure, applications, and services. As a part of StarWind ecosystem, StarWind Command Center allows managing a hypervisor (VMware vSphere, Microsoft Hyper-V, Red Hat KVM, etc.), integrates with Veeam Backup & Replication and public cloud infrastructure. On top of that, the solution incorporates StarWind ProActive Premium Support that monitors the cluster 24/7, predicts failures, and reacts to them before things go south.

How StarWind Command Center can be integrated into HCI
How StarWind Command Center can be integrated into HCI

For example, StarWind Command Center Storage Performance Dashboard features an interactive chart plotting cluster-wide aggregate IOPS measured at the CSV filesystem layer in Windows. More detailed reporting is available in the command-line output of DISKSPD and VM Fleet.

benchmarking

The other side of storage performance is the latency – how long an IO takes to complete. Many storage systems perform better under heavy queuing, which helps to maximize parallelism and busy time at every layer of the stack. But there’s a tradeoff: queuing increases latency. For example, if you can do 100 IOPS with sub-millisecond delay, you may also be able to achieve 200 IOPS if you can tolerate higher latency.

Latency is good to watch out for: sometimes the largest IOPS benchmarking numbers are only possible with latency that would otherwise be unacceptable.

Cluster-wide aggregate IO latency, as measured at the same layer in Windows, is plotted on the HCI Dashboard too.

Results

Any storage system that provides fault tolerance necessarily makes distributed copies of writes, which must traverse the network and incurs backend write amplification. For this reason, the largest IOPS benchmark numbers are typically achieved only with reads, especially if the storage system has common-sense optimizations to read from the local copy whenever possible, which StarWind Virtual SAN does.

NOTE: To make it right, we show you VM Fleet results and StarWind Command Center results together with the videos of those tests.

Action 1: 4К random read --> .\Start-Sweep.ps1 -b 4 -t 2 -o 32 -w 0 -p r -d 900

With 100% reads, the cluster delivered 6,709,997 IOPS. This is 51% performance out of the theoretical value of 13,200,000 IOPS.

Where did we get the reference? Each node had two Intel Optane NVMe SSDs, each performed 550,000 IOPS. 2 disks * 12 nodes * 550,000 IOPS/drive = 13,200,000 IOPS.

benchmarking
benchmarking

Action 2: 4К random read/write 90/10 --> .\Start-Sweep.ps1 -b 4 -t 2 -o 32 -w 10 -p r -d 900

With 90% random reads and 10% writes, the cluster delivers 5,139,741 IOPS.

benchmarking
benchmarking

Action 3: 4К random read/write 70/30 --> .\Start-Sweep.ps1 -b 4 -t 2 -o 32 -w 30 -p r -d 900

With 70% random reads and 30% writes, the cluster delivers 3,434,870 IOPS.

NOTE: We had double writes since StarWind Virtual SAN synchronizes each virtual disk. This can be seen in StarWind Command Center.

benchmarking
benchmarking

Action 4: 2M sequential read --> .\Start-Sweep.ps1 -b 2048 -t 2 -o 16 -p s -w 0 -d 900

With 100% random reads, the cluster utilizes the network throughput – 60.06GBps.

benchmarking
benchmarking

Action 5: 2M sequential write --> .\Start-Sweep.ps1 -b 2048 -t 2 -o 16 -p s -w 0 -d 900

With 100% random writes, the cluster utilizes all the network throughput – 50.81GBps.

benchmarking

Due to disk mirroring, in StarWind Command Center Disk Throughput is accurate 50.81GBps being x2 VM Fleet return.

benchmarking

Here are all the results, with the same 12-server HCI cluster:

Run Parameters Result
Maximize IOPS, all-read 4 kB random, 100% read 6,709,997 IOPS1
Maximize IOPS, read/write 4 kB random, 90% read, 10% write 5,139,741 IOPS
Maximize IOPS, read/write 4 kB random, 70% read, 30% write 3,434,870 IOPS
Maximize throughput 2 MB sequential, 100% read 61.9GBps2
Maximize throughput 2 MB sequential, 100% write 50.81GBps

1 - 51% performance out of theoretical 13,200,000 IOPS
2 - 53% bandwidth out of theoretical 112.5GBps

Addition

So far, we saw that StarWind was not 100% loaded and can saturate more performance. We face expected high latency on 4K IO blocks’ results.

Memory latency vs. Access range
Memory latency vs. Access range

To achieve the maximum performance, we installed 2 more Intel Optane NVMe SSDs and run 2 additional tests. Here are the results.

Action 6: 2M sequential read --> .\Start-Sweep.ps1 -b 2048 -t 2 -o 16 -p s -w 0 -d 900

With 100% random reads, the cluster utilizes all the network throughput – 108.38GBps.

This is 96% performance out of theoretical 112.5GBps.

benchmarking
benchmarking

Action 7: 2M sequential write --> .\Start-Sweep.ps1 -b 2048 -t 2 -o 16 -p s -w 0 -d 900

With 100% sequential 2M block writes, the cluster throughput is 100.29GBps. These numbers are accurate for entire cluster that has continuously synchronized shared storage.

benchmarking
benchmarking

Here are all the results, with the same 12-server HCI cluster:

Run Parameters Result
Maximize IOPS, all-read 4 kB random, 100% read 6,709,997 IOPS1
Maximize IOPS, read/write 4 kB random, 90% read, 10% write 5,139,741 IOPS
Maximize IOPS, read/write 4 kB random, 70% read, 30% write 3,434,870 IOPS
Maximize throughput 2 MB sequential, 100% read 61.9GBps
Maximize throughput 2 MB sequential, 100% write 50.81GBps
Maximize throughput +2NVMe SSD 2 MB sequential, 100% write 108.38GBps2
Maximize throughput +2NVMe SSD 2 MB sequential, 100% write 100.29GBps

1-51% performance out of theoretical 13,200,000 IOPS
2-96% bandwidth out of theoretical 112.5GBps

Conclusion

This was the first stage benchmark results of 12-node production-ready StarWind HCA cluster wrapped in Supermicro SuperServer platform, each powered by Intel® Xeon® Platinum 8268 processors and Intel® Optane™ SSD DC P4800X Series drives, and Mellanox ConnectX-5 100GbE NICs.

The cache-less 12-node HCA cluster delivered 6.7 million IOPS, 51% out of theoretical 13.2 million IOPS. This is breakthrough performance in pure production configuration (only iSCSI without RDMA for client access is used). Backbone was running over iSER, and no proprietary technology was used. Similar performance results can be obtained with any hypervisor using pure iSCSI initiators and StarWind Virtual SAN.

For the next benchmark stage, we are going to max out IO performance by configuring Intel Optane NVMe SSDs as caching devices, just as Intel recommends.