Data Locality

Stage 2: iSCSI shared storage with write-back cache enabled

How does write-back caching affect all-NVMe storage performance? Let’s find this out.

Introduction

For the first HCI benchmark stage, we obtained 6.7 million IOPS, 51% out of theoretical 13.2 million IOPS, over a production-ready StarWind HCA cluster. The test environment consisted of 12 Supermicro SuperServers powered by Intel® Xeon® Platinum 8268 processors, Intel® Optane™ SSD DC P4800X Series drives, and Mellanox ConnectX-5 100GbE NICs, all connected by Mellanox SN2700 Spectrum™ switches and Mellanox LinkX® copper cables. This is a standard hardware configuration for our production HCAs where only CPUs were upgraded for chasing HCI industry record.

In this article, we are going to see how write-back caching changes the play.


Changes in the environment

We doubled the power of the cluster used before by adding 4x Intel® Optane™ SSD DC P4800X Series drives to each node. These drives were configured for write-back cache according to Intel recommendations.

There were some limitations for Enterprise HCI. Intel Optane NVMe was used as a dedicated cache device for each virtual machine that handles IO. VMs in our test environment cannot be migrated. This being said, this scenario should not be meant as a general-purpose case for Enterprise. The highly-available cluster configuration featured data replication on VM or application levels. This is typical for SQL AGs, SAP, and other databases, so it works well for DB-guys who use their own replication tools (on application level).

In our setup, StarWind HCA has the fastest software running: Microsoft Hyper-V (Windows Server 2019) and StarWind Virtual SAN. The latter runs in Windows userland and supports polling. In addition, StarWind Virtual SAN supports interrupt-driven IO allowing to boost IO performance by turning CPU cycles into IOPS. We also developed TCP loopback and iSCSI loopback drivers fixing load-balancing issues in aging Microsoft iSCSI Initiator.

On our website, you can learn more about hyperconverged infrastructures powered by StarWind Virtual.


12-node StarWind HyperConverged Appliance cluster specifications

Platform: Supermicro SuperServer 2029UZ-TR4+,
CPU: 2x Intel® Xeon® Platinum 8268 Processor 2.90 GHz, RAM 96GB. Intel® Turbo Boost ON, Intel® Hyper-Threading ON
Boot and Storage Capacity: 2x Intel® SSD D3-S4510 Series
Write-Back Cache Capacity: 4x Intel® Optane™ SSD DC P4800X Series. The latest available firmware installed.
Networking: 2x Mellanox ConnectX-5 MCX516A-CCAT 100GbE Dual-Port NIC
Switch: 2x Mellanox SN2700 Spectrum 32 ports 100 GbE Ethernet Switch

The diagram below illustrates servers’ interconnection.

interconnection-diagram.png
Interconnection diagram

NOTE: On every server, each NUMA node had 2x Intel® Optane™ SSD DC P4800X Series and 1x Mellanox ConnectX-5 100 GbE NICs connected in order to squeeze maximum performance from each piece of hardware. Such connection is rather a recommendation than a strict requirement. To obtain similar performance, no tweaking NUMA node configuration is required, meaning that the default settings are OK.

Software Setup

Operating system. Windows Server 2019 Datacenter Evaluation version 1809, build 17763.404, was installed on all nodes together with the latest updates available on May 1, 2019. With performance perspective in mind, the power plan was set to High Performance. All other settings, including the relevant side-channel mitigations (mitigations for Spectre v1 and Meltdown were applied), were left at default settings.

Windows Installation. Installing Hyper-V Role and configuring MPIO and Failover Cluster. To make the deployment process faster, we made an image of Windows Server 2019 with Hyper-V Role installed and MPIO and Failover Cluster features enabled. Later, the image was deployed on 12 Supermicro servers.

iscsi-shared-storage-with-wb-cache-enabled-1.png

Driver installation. Firmware update. Once Windows was installed, Windows Update was applied to each piece of hardware. Firmware updates for Intel NVMe SSDs were installed too.

StarWind Virtual SAN. Current production-ready StarWind VSAN version (8.0.0.12996) was installed on each server. Microsoft recommends creating at least one Cluster Shared Volume per server node. Therefore, for 12 servers, we created 12 volumes with ReFS. ReFS showed us decent performance overtopping the performance we had using NTFS. Each volume was 110 GiB, making together 1.3 TiB of the total usable storage capacity. Each volume used two-way mirror resiliency with allocation delimited to two servers. All other settings, like columns and interleave, were left to default. To accurately measure persistent storage IOPS, we disabled in-memory CSV read cache.

iscsi-shared-storage-with-wb-cache-enabled-2

StarWind iSCSI Accelerator. We used built-in Microsoft iSCSI Initiator together with our own user-mode iSCSI initiator. Microsoft iSCSI Initiator was developed in “the stone age” when servers had one- or two-socket CPUs with a single core per socket. Having more powerful servers nowadays, the Initiator does not work as well as it should.

That’s why we developed iSCSI Accelerator – a filter driver between Microsoft iSCSI Initiator and the hardware presented over the network. Every time a new iSCSI session is created, it is assigned to a free CPU core. Therefore, performance of all CPU cores is used uniformly, and latency approached zero. Distributing workloads in such way ensures smart compute resource utilization: no cores are overwhelmed while others idle.

CPU Core load diagram
CPU Core load diagram

StarWind iSCSI Accelerator (Load Balancer) was installed on each node in order to balance virtualized workloads between all CPU cores in Hyper-V servers.

StarWind Loopback Accelerator. As a part of StarWind Virtual SAN, StarWind Loopback Accelerator was installed and configured in order to significantly decrease the latency and CPU load for cases where Microsoft iSCSI Initiator was connected to StarWind iSCSI Target over loopback. This piece of software allows configuring zero-copy memory in the loopback mode, enabling to bypass most of the TCP stack.

NOTE: Due to the fast path provided by StarWind Loopback Accelerator, each iSCSI LUN had 2 loopback iSCSI session and 2 external partner iSCSI sessions. Least Queue Depth (LQD) MPIO policy was set. This policy maximizes network bandwidth utilization and automatically uses the active/optimized path with the smallest current outstanding queue.

iSCSI sessions interconnection diagram
iSCSI sessions interconnection diagram

Block iSCSI / iSER (RDMA). Like an environment built of StarWind HyperConverged Appliances, today’s 12-node HCI cluster featured Mellanox NICs and switches. StarWind Virtual SAN uses iSER for backbone links to RDMA, delivering maximum possible performance.

NOTE: Windows Server 2019 doesn’t have iSER (RDMA) support yet. Lack of all-RDMA connections induces pressure on memory and limits performance. To eliminate the local Windows TCP/IP stack overhead, StarWind’s built-in userland iSER initiator was used for data and metadata synchronization and acknowledgment “guest” writes over iSER (RDMA).

accelerating-io-performance.png
Accelerating IO Performance

As a result, the IO performance was accelerated with a combination of RDMA, DMA in loopback, and TCP connections.

NUMA node. Taking into account NUMA node configuration on each cluster node, the virtual disk was configured to replicate shared storage between two servers using a network adapter located on the same NUMA node. For example, on cluster node 3, the virtual disk was created on Intel Optane SSD that was located on NUMA node 1. So, for disk mirroring, Mellanox ConnectX-5 100 GbE NIC was also assigned to NUMA node 1.

NUMA node assignment diagram
NUMA node assignment diagram

CSV. For 12-node hyperconverged cluster, 12 Cluster Shared Volumes were created on top of 12 StarWind synchronously-mirrored virtual disks according to Microsoft recommendations.

iscsi-shared-storage-with-wb-cache-enabled-4

Hyper-V VMs. To benchmark cluster performance, we populated it with 48 Windows Server 2019 Standard Gen 2 VMs – 4 VMs per-node. Each VM had 24 virtual processors and 8 GiB of memory, so 96 cores of each server were utilized.

Write-Back Cache. Intel Optane NVMe SSDs were configured as write-back caching devices for VMs. Each NVMe drive was assigned to the specific VM.

iscsi-shared-storage-with-wb-cache-enabled-5

NOTE: NUMA spanning was disabled to ensure that virtual machines always ran with optimal performance according to known facts about NUMA settings.

Benchmarking

In virtualization and hyperconverged infrastructures, it’s common to judge on solution performance based on the number of storage input/output (I/O) operations per second, or “IOPS” – essentially, the number of reads or writes that virtual machines can perform. A single VM can generate a huge number of either random or sequential reads/writes. In real production environments, there are tons of VMs which make the IOPS fully randomized. 4 kB block-aligned IO is a block size that Hyper-V virtual machines perform, so it was our benchmark of choice.

Hardware and software vendors often use this kind of pattern to measure the best performance in the worst circumstances.

In this article, we not only performed the same benchmarks as Microsoft while measuring Storage Spaces Direct performance but also carried out additional tests for other IO patterns that are commonly used in virtualization production environments

VM Fleet. We used the open-source VM Fleet tool available on GitHub. VM Fleet makes it easy to orchestrate DISKSPD, the popular Windows micro-benchmark tool, in hundreds or thousands of Hyper-V virtual machines at once.

Let’s take a closer look at DISKSPD settings. To saturate performance, we set 16 threads per file (-t16). Taking into account Intel Optane recommendations, the given number of threads was used for numerous storage IO tests. As a result, we got the highest storage performance in the saturation point under 32 outstanding IOs per thread (-o32). To disable the hardware and software caching, we set unbuffered IO (-Sh). We specified -r for random workloads and -b4K for 4 kB block size. Read/write proportion was altered with the -w parameter.

Here’s how DISKSPD was started: .\diskspd.exe -b4 -t16 -o32 -w0 -Sh -L -r -d900 [...]

NOTE: All writes are cached first. That’s why we modified the VM Fleet scripts for benchmarking. We wanted VM Fleet to read IO performance from both cache and CSVFS.

StarWind Command Center. Designed as a replacement for routine tasks to featureless Windows Admin Center and bulky System Center Configuration Manager, StarWind Command Center consolidates sophisticated dashboards that cover all the important information about the state of each environment component on a single screen.

Being a single-pane-of-glass tool, StarWind Command Center allows solving the whole range of tasks related to managing and monitoring your IT infrastructure, applications, and services. As a part of StarWind ecosystem, StarWind Command Center allows managing a hypervisor (VMware vSphere, Microsoft Hyper-V, Red Hat KVM, etc.) and integrates with Veeam Backup & Replication and public cloud infrastructure. On top of that, the solution incorporates StarWind ProActive Support that monitors the cluster 24/7, predicts failures, and reacts to them before things go south.

iscsi-shared-storage-with-wb-cache-enabled-6
How StarWind Command Center can be integrated into HCI

For instance, StarWind Command Center Storage Performance Dashboard features an interactive chart of cluster-wide aggregate IOPS measured at the CSV filesystem layer in Windows. More detailed reporting is available in the command-line output of DISKSPD and VM Fleet.

iscsi-shared-storage-with-wb-cache-enabled-7

The other side of storage performance is the latency – how long an IO takes to complete. Many storage systems perform better under heavy queuing, which helps to maximize parallelism and busy time at every layer of the stack. But there’s a tradeoff: queuing increases latency. For example, if you can do 100 IOPS with sub-millisecond delay, you may also be able to achieve 200 IOPS if you can tolerate higher delay. Latency time is good to watch out for: sometimes the largest IOPS benchmarking numbers are only possible with delays that would otherwise be unacceptable.

Cluster-wide aggregate IO latency, as measured at the same layer in Windows, is plotted on the HCI Dashboard too.

Results

Any storage system that provides fault tolerance makes distributed copies of writes, which must traverse the network and incurs backend write amplification. For this reason, the largest IOPS benchmark numbers are typically achieved only with reads, especially if the storage system has common-sense optimizations to read from the local copy whenever possible, which StarWind Virtual SAN does.

NOTE: To make it right, we show you VM Fleet results and StarWind Command Center results together with the videos of those tests.

Action 1: 4К random read --> .\Start-Sweep.ps1 -b 4 -t 16 -o 32 -w 0 -d 1500 -p r

With 100% reads, the cluster delivered 26,834,060 IOPS. This is 101.5% performance out of theoretical 26,400,000 IOPS. Every node had four Intel Optane NVMe SSDs, each performed 550,000 IOPS.

iscsi-shared-storage-with-wb-cache-enabled-8
iscsi-shared-storage-with-wb-cache-enabled-9

Action 2: 4К random read/write 90/10 --> .\Start-Sweep.ps1 -b 4 -t 16 -o 32 -w 10 -d 1500 -p r

With 90% random reads and 10% writes, the cluster delivered 25,840,684 IOPS.

iscsi-shared-storage-with-wb-cache-enabled-10
iscsi-shared-storage-with-wb-cache-enabled-11

Action 3: 4К random read/write 70/30 --> .\Start-Sweep.ps1 -b 4 -t 16 -o 32 -w 30 -d 1500 -p r

With 70% random reads and 30% writes, the cluster delivered 16,034,494 IOPS.

NOTE: We observed double writes since StarWind Virtual SAN synchronizes each virtual disk. This can be clearly seen in StarWind Command Center.

iscsi-shared-storage-with-wb-cache-enabled-12
iscsi-shared-storage-with-wb-cache-enabled-13

Action 4: 2M sequential read --> .\Start-Sweep.ps1 -b 2048 -t 4 -o 8 -w 0 -d 900 -p s

With 100% random reads, the cluster fully utilized the network (network throughput was ‭124.58GBps).‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬

Actually, this was 110% out of theoretical 112.5GBps.

iscsi-shared-storage-with-wb-cache-enabled-14
iscsi-shared-storage-with-wb-cache-enabled-15

Action 5: 2M sequential write --> .\Start-Sweep.ps1 -b 2048 -t 4 -o 8 -w 100 -d 900 -p s

With 2M sequential writes, the cluster completely utilized the network (network throughput was ‭106.65GBps).‬‬‬‬‬‬‬‬‬‬‬

Actually, this was 110% out of theoretical 112.5GBps.

iscsi-shared-storage-with-wb-cache-enabled-16
iscsi-shared-storage-with-wb-cache-enabled-17

Here are all the results for 12-server HCI cluster performance:

RunParametersResult
Maximize IOPS, all-read 4 kB random, 100% read 26,834,060 IOPS1
Maximize IOPS, read/write 4 kB random, 90% read, 10% write 25,840,684 IOPS
Maximize IOPS, read/write 4 kB random, 70% read, 30% write 16,034,494 IOPS
Maximize throughput 2 MB sequential, 100% read 124.58GBps2
Maximize throughput 2 MB sequential, 100% write 106.65GBps

1 - 101.5% performance out of theoretical 26,400,000 IOPS
2 - 110.7% performance out of theoretical 112.5GBps

Conclusion

This article presents results of the second benchmarking stage. It highlights the performance of a 12-node production-ready StarWind HCA cluster where each server was powered with Intel® Xeon® Platinum 8268 processors, Intel® Optane™ SSD DC P4800X Series drives as dedicated cache, and Mellanox ConnectX-5 100 GbE NICs.

12-node all-NVMe cluster delivered 26.834 million IOPS, 101.5% performance out of theoretical 26.4 million IOPS.

This was really breakthrough performance for a production configuration that became possible due to using write-back cache (iSCSI without RDMA for client access is used land) and backbone connections running over iSER. In our environment, no proprietary technologies were used, meaning that similar performance can be obtained with any hypervisor using iSCSI initiators, StarWind Virtual SAN, and Intel Optane configured as caching device.




Request a Callback Gartner`s Niche Player