FF
Data Locality

Stage 3: What is performance like when storage is presented over NVMe-oF

Now, it’s the time to see how NVMe-oF works in a 12-node all-NVMe cluster

Introduction

For the previous HCI benchmark stage, we obtained 26.8 million IOPS (101% out of theoretical 26.4 million IOPS) in a production-ready StarWind HyperConverged Appliance cluster. The test environment was comprised of 12 Supermicro SuperServers featured by Intel® Xeon® Platinum 8268 processors and Intel® Optane™ SSD DC P4800X Series drives. Like in StarWind HyperConverged Appliances that we ship, for this setup we used Mellanox ConnectX-5 100GbE NICs that were connected with Mellanox SN2700 Spectrum™ switches and Mellanox LinkX® copper cables.

This article presents the results of the last benchmark stage where all NVMe drives in the 12-node production-ready StarWind HCA cluster were presented over NVMe-oF.

Some words about the setup

StarWind HCAs used for this test had maximum power: we added 4x Intel® Optane™ SSD DC P4800X Series drives to each node and presented them over NVMe-oF to the entire cluster.

NVMe-oF is a protocol tailored to present flash over RDMA. Our implementation of this protocol, StarWind NVMe-oF, was used in this study.

In our setup, StarWind HCA had the fastest software running: Microsoft Hyper-V (Windows Server 2019) and StarWind Initiator service application. The latter ran in Windows userland. For this scenario, StarWind NVMe-oF Initiator featured storage interconnection while SPDK NVMe was used as a target.

There were certain limitations in our Enterprise HCI environment. Intel Optane NVMes were used as the fastest underlying storage connected to virtual machines that handle IO. The storage had no block-level replication, which means that there was no "NVMe reservation" functionally. As a result, the highly-available cluster configuration featured data replication on VM or application levels. This is typical for SQL AGs, SAP, and other databases, so it works well for DB-guys who use their own replication tools (on application level). We, in turn, brought NVMe Reservation functionality to StarWind Virtual SAN.

Learn more about hyperconverged infrastructures powered by StarWind Virtual SAN on our website.

12-node StarWind HyperConverged Appliance cluster specifications

Platform: Supermicro SuperServer 2029UZ-TR4+
CPU: 2x Intel® Xeon® Platinum 8268 Processor 2.90 GHz, RAM 96GB. Intel® Turbo Boost ON, Intel® Hyper-Threading ON
Boot and Storage Capacity: 2x Intel® SSD D3-S4510 Series
Storage Capacity: 4x Intel® Optane™ SSD DC P4800X Series. The latest available firmware installed
Networking: 2x Mellanox ConnectX-5 MCX516A-CCAT 100GbE Dual-Port NIC
Switch: 2x Mellanox SN2700 Spectrum 32 ports 100GbE Ethernet Switch

The diagram below illustrates servers’ interconnection.


Interconnection diagram.png
Interconnection diagram

NOTE: On every server, each NUMA node had 2x Intel® Optane™ SSD DC P4800X Series and 1x Mellanox 100 GbE NIC connected in order to squeeze maximum performance from each piece of hardware. Such connection is rather a recommendation than a strict requirement. To obtain similar performance, no tweaking NUMA node configuration is required, meaning that the default settings are OK.

Software Setup

Operating system. Windows Server 2019 Datacenter Evaluation version 1809, build 17763.404, was installed on all nodes with the latest updates available on June 11, 2019. With a performance perspective in mind, the power plan was set to High Performance. All other settings, including the relevant side-channel mitigations (mitigations for Spectre v1 and Meltdown are applied), were left at default settings.

Windows Installation. Installing Hyper-V Role and configuring MPIO and Failover Cluster feature. Windows Server 2019 Datacenter Evaluation version 1809, build 17763.404, was installed on all nodes with the latest updates available on June 11, 2019. With a performance perspective in mind, the power plan was set to High Performance. All other settings, including the relevant side-channel mitigations (mitigations for Spectre v1 and Meltdown are applied), were left at default settings.


Add roles.png


Driver installation. Firmware update. Once Windows was installed, Windows Update was applied to each piece of hardware, and firmware updates for Intel NVMe SSDs were installed.

SPDK target VM. In our cluster, we deployed 24x Linux VMs running SPDK NVMe-oF target. 2 target VMs, one per NUMA node, were running on each server. 2x NVMe SSD and 1 NIC port served as a passthrough to each target VM.

Each target VM had 4 vCPUs, 6GB RAM, and 10GB storage. CentOS 7.6 was installed on target VM along with SPDK, version 19.01.1, and DPDK, version 18.11.0.

Here’s nvme.conf file listing



Targets are created by executing the command: ./nvmf_tgt -m [0,2] -c ./nvmf.conf

Hyper-V VMs. To benchmark cluster performance, we populated it with 576 Windows Server 2019 Standard Gen 2 VMs – 48 VMs per-node. Each VM had 1 virtual processor and 1GiB of memory, so 48 cores of each server were utilized. Each VM had 25GB storage.


benchmark cluster performance


NOTE: According to the NUMA spanning recommendations, NUMA spanning was disabled to ensure that virtual machines always ran with optimal performance.

RDMA. Mellanox ConnectX-5 NICs incorporate Resilient RoCE to provide the best-of-breed performance with only simple enablement of Explicit Congestion Notification (ECN) on the network switches. Lossless fabric, which is usually achieved through enablement of PFC, is not mandated anymore. The Resilient RoCE congestion management implemented in ConnectX NIC hardware delivers reliability even with UDP over a lossy network. Mellanox Spectrum Ethernet switches provide 100GbE line-rate performance and consistent low latency with zero packet loss. Additionally, Spectrum makes it easy to configure RoCE and has end-to-end flow level visibility.

StarWind NVMe-oF Initiator. NVMe-oF, or NVMe over Fabrics, is a network protocol, like iSCSI, used to communicate between a host and storage system over a network (aka fabric) utilizing RDMA. This is an emerging technology that gives data centers unprecedented access to NVMe SSD storage. We developed our own NVMe-oF Initiator for Windows using NVMe storport miniport driver model.

In this setup, StarWind NVMe-oF Initiator featured storage interconnection and SPDK NVMe was used as a target. The initiator was written from scratch since Windows kernel doesn’t have SPDK and there are no pooling drivers for NIC and cache. It is worthy of being mentioned that we even managed to get close to those "dramatically low-latency results" seen on Linux. We believe that reaching out Linux, Microsoft will decide to rewrite the kernel at some point, or they will simply allow other vendors to develop a set of applications that match the storage performance on Linux.

NUMA node. Taking into account NUMA node configuration on each cluster node, each NVMe SSD was configured to use a network adapter assigned to the same NUMA node. For example, on cluster node 3, Intel Optane SSD was paired with Mellanox NIC located on NUMA node 1. Each cluster node is connected to VM target via grid as illustrated in the scheme below.


NUMA node assignment diagram for RAW device tests
NUMA node assignment diagram for RAW device tests

NUMA node assignment diagram for VM-based tests
NUMA node assignment diagram for VM-based tests

CPU Groups. CPU Groups allow Hyper-V administrators for managing and distributing host CPU resources across guest virtual machines. CPU Groups make it easy to isolate VMs belonging to different CPU groups (i.e., NUMA nodes) from each other in order to assign NIC and NVMe SSDs to specific VMs.

Hyper-V scheduler. For Windows Server 2019, the default Hyper-V scheduler is the core scheduler. Core scheduler offers a strong security boundary for guest workload isolation but has lower performance.
We set it to the classic scheduler, the default setting for all versions of the Windows Hyper-V.

Hyper-V Virtual Machine Bus (VMBus) Multi-Channel. Hyper-V VMBus is one of the mechanisms used by Hyper-V to offer paravirtualization. In short, it is a virtual bus device that sets up channels between the guest and the host. These channels provide the capability to share data between partitions and setup synthetic devices.

In our scheme, each VM had 1 vCPU to optimize I/O workload and eliminate waiting time. We set 1: 1 match -> VMBus channel: vCPU.


Hyper-V Virtual Machine Bus
Communication in partitions via Hyper-V VMBus

Benchmarking

In virtualization and hyperconverged infrastructures, it’s common to judge on solution performance based on the number of storage input/output (I/O) operations per second, or “IOPS” – essentially, the number of reads or writes that virtual machines can perform. A single VM can generate a huge number of either random or sequential reads/writes. In real production environments, there are tons of VMs which make the IOPS fully randomized. 4 kB block-aligned IO is a block size that Hyper-V virtual machines perform, so it was our benchmark of choice.

Hardware and software vendors often use this kind of pattern to measure the best performance in the worst circumstances.

In this article, we not only performed the same benchmarks as Microsoft while measuring Storage Spaces Direct performance but also carried out additional tests for other I/O patterns that are commonly used in virtualization production environments.

For this benchmark stage, we measured performance both for RAW storage and inside VMs. RAW device configuration was tested with VM Fleet and FIO. VM-based I/O benchmarks were run using VM Fleet and DSKSPD. In addition to VM Fleet, StarWind Command Center was used to monitor performance.

FIO is a tool that spawns a number of threads or processes doing a particular type of I/O action as specified by the user. The typical use of FIO is to write a job file matching the I/O load one wants to simulate.

In order to benchmark RAW device performance, we connected each cluster node to target VM using StarWind NVMe-oF Initiator. The client was a bare metal host, the server was a target VM running SPDK.

FIO configuration file options for testing RAW device performance

  • ioengine=windowsaio – defines how the job impacts I/O, we use Windows native asynchronous I/O.
  • blocksize=4k, 2048k – Block size for I/O units. Default: 4k. Values for reads and writes can be specified separately in the format read, write, either of which may be empty to leave that value at its default.
  • rw=randrw, readwrite, write - a type of I/O pattern.
  • rwmixread=90, 70 – the percentage of reads in a mixed workload.
  • thread – use threads created with pthread_create instead of processes created with fork.
  • direct=1 – use non-buffered I/O.
  • buffered=0 – this is the opposite of the direct parameter. By default, it is set to true.
  • time_based – run for the specified runtime duration even if the files are completely read or written. The same workload will be repeated as many times as runtime allows.
  • timeout=900 – runtime duration.
  • group_reporting – display per-group reports instead of per-job when the numjobs parameter is specified.
  • iodepth=8 – Number of I/O units to keep in flight against the file.
  • numjobs=8 – Number of clones (processes/threads performing the same workload) of this job.

VM Fleet. We used the open-source VM Fleet tool available on GitHub. VM Fleet makes it easy to orchestrate DISKSPD or FIO (Flexible I/O tester), popular Windows micro-benchmark tools, in hundreds or thousands of Hyper-V virtual machines at once.

VM Fleet configuration for VM-based benchmarks. To saturate performance, we set 1 thread per file (-t1). Taking into account Intel Optane recommendations, the given number of threads was used for numerous storage IO tests. As a result, we got the highest storage performance in the saturation point under 16 outstanding IOs per thread (-o16). To disable the hardware and software caching, we set unbuffered IO (-Sh). We specified -r for random workloads and -b4K for 4 kB block size. Read/write proportion was altered with the -w parameter.

Here’s how DISKSPD was started: .\diskspd.exe -b4 -t1 -o16 -w0 -Sh -L -r -d900 [...]

NOTE: We modified the VM Fleet scripts to benchmark storage accurately. By default, VM Fleet reads I/O performance from CSVFS. Considering that storage was presented over NVMe-oF, we set up VM Fleet to get performance from local storage.

StarWind Command Center. Designed as an alternative for routine tasks to featureless Windows Admin Center and bulky System Center Configuration Manager, StarWind Command Center consolidates sophisticated dashboards that cover all the important information about the state of each environment component on a single screen.

Being a single-pane-of-glass tool, StarWind Command Center allows solving the whole range of tasks related to managing and monitoring your IT infrastructure, applications, and services. As a part of StarWind ecosystem, StarWind Command Center allows managing a hypervisor (VMware vSphere, Microsoft Hyper-V, Red Hat KVM, etc.) and integrates with Veeam Backup & Replication and public cloud infrastructure. On top of that, the solution incorporates StarWind ProActive Support that monitors the cluster 24/7, predicts failures, and reacts to them before things go south.


benchmark
How StarWind Command Center can be integrated into HCI


For instance, StarWind Command Center Storage Performance Dashboard features an interactive chart of cluster-wide aggregate IOPS measured at the CSV layer in Windows. More detailed reporting is available in the command-line output of DISKSPD and VM Fleet.

benchmark


The other side of storage performance is the latency – how long an IO takes to complete. Many storage systems perform better under heavy queuing, which helps to maximize parallelism and busy time at every layer of the stack. But there’s a tradeoff: queuing increases latency. For example, if you can do 100 IOPS with sub-millisecond delay, you may also be able to achieve 200 IOPS if you can tolerate higher delay. Latency time is good to watch out for: sometimes the largest IOPS benchmarking numbers are only possible with delays that would otherwise be unacceptable.

Cluster-wide aggregate IO latency, as measured at the same layer in Windows, is plotted on the HCI Dashboard too.


Results

Any storage system that provides fault tolerance makes distributed copies of writes, which must traverse the network and incurs backend write amplification. For this reason, the largest IOPS benchmark numbers are typically associated with reads, especially if the storage system has common-sense optimizations to read from the local copy whenever possible, which StarWind Virtual SAN does.

NOTE: To make it right, we present the VM Fleet results and StarWind Command Center results.

Action 1: RAW device 4К random read

With 100% reads, the cluster delivered 22,239,158 IOPS. This was 84% performance out of theoretical 26,400,000 IOPS: each node had four Intel Optane NVMe SSD each drive performed 550,000 IOPS.

benchmark
benchmark
Action 2: RAW device 4К random read/write 90/10
With 90% random reads and 10% writes, the cluster delivered 21,923,445 IOPS.
benchmark
benchmark
Action 3: RAW device 4К random read/write 70/30
With 70% random reads and 30% writes, the cluster delivered 21,906,429 IOPS.
benchmark
benchmark
Action 4: RAW device 2M sequential read
With 2M sequential reads, the cluster fully utilized the network‭ (network throughput was 124.72GBps).‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬
This makes 110.8% performance out of theoretical 112.5GBps.
benchmark
benchmark
Action 5: RAW device 2M sequential write
With 2M sequential writes, the cluster completely utilized the network (‭106.49GBps).‬‬‬‬‬‬
benchmark
benchmark
Action 6: VM-based 4К random read --> .\Start-Sweep.ps1 -p r -b 4 -t 1 -o 16 -w 10 -d 900
With 100% reads, the VMs performed 20,187,670 IOPS. This was 76% performance out of theoretical 26,400,000 IOPS. Every node had 4 Intel Optane NVMe SSDs, each performed 550,000 IOPS.
benchmark
benchmark
Action 7: VM-based 4К random read/write 90/10 --> .\Start-Sweep.ps1 -b 4 -t 1 -o 16 -w 10 -d 900 -p r
With 90% random reads and 10% writes, the cluster delivered 19,882,005 IOPS.
benchmark
benchmark
Action 8: VM-based 4К random read/write 70/30 --> .\Start-Sweep.ps1 -b 4 -t 1 -o 16 -w 30 -d 900 -p r
With 70% random reads and 30% writes, the cluster delivered 19,229,996 IOPS.
benchmark
benchmark
Action 9: VM-based 2M sequential read --> .\Start-Sweep.ps1 -p s -b 2048 -t 1 -o 4 -w 0 -d 900
With 2M sequential reads, the cluster completely utilized the network network throughput was 124.55GBps)‬‬‬‬‬‬‬‬‬‬‬‬.
This was 110.7% performance out of theoretical 112.5GBps.
benchmark
benchmark
Action 10: VM-based 2M sequential write --> -p s -b 2048 -t 1 -o 4 -w 100 -d 900
With 2M sequential writes, the cluster completely utilized the network (throughput reached ‭107.48GBps)‬‬‬‬‬‬‬‬‬‬‬‬‬.‬‬‬‬‬‬‬
benchmark
benchmark

Here are all the results obtained for the same 12-server HCI cluster.

Run Parameters Result
RAW device, Maximize IOPS, all-read 4 kB random, 100% read 22,239,158 IOPS1
RAW device, Maximize IOPS, read/write 4 kB random, 90% read, 10% write 21,923,445 IOPS
RAW device, Maximize IOPS, read/write 4 kB random, 70% read, 30% write 21,906,429 IOPS
RAW device, Maximize throughput 2 MB sequential, 100% read 124.72GBps22
RAW device, Maximize throughput 2 MB sequential, 100% write 106.49GBps
VM-based, Maximize IOPS, all-read 4 kB random, 100% read 20,187,670 IOPS33
VM-based, Maximize IOPS, read/write 4 kB random, 90% read, 10% write 19,882,005 IOPS
VM-based, Maximize IOPS, read/write 4 kB random, 70% read, 30% write 19,229,996 IOPS
VM-based, Maximize throughput 2 MB sequential, 100% read ‭124.55GBps4‬‬‬‬‬‬‬‬‬‬‬4
VM-based, Maximize throughput 2 MB sequential, 100% write 107.48GBps

1 - 84% performance out of theoretical 26,400,000 IOPS
2 - 110.8% performance out of theoretical 112.5GBps
3 - 76% performance out of theoretical 26,400,000 IOPS
4 - 110.7% performance out of theoretical 112.5GBps

Conclusion

This is pretty much it for the third stage of benchmark of 12-node production-ready StarWind HCA cluster wrapped in Supermicro SuperServer platform. 12-node all-NVMe cluster delivered 20.187 million IOPS, 84% performance out of theoretical 26.4 million IOPS.

The breakthrough performance was obtained by configuring NVMe-oF cluster right. The fastest NVMe storage is the passthrough to SPDK NVMe target VM and StarWind NVMe-oF Initiator features storage interconnection.

Request a Callback Gartner`s Niche Player