When vSphere performance tanks, you don’t need a lecture on the complexity of virtualization. You need to know if the problem is the storage array, a noisy neighbor VM, or a physical resource bottleneck.
Our updated guide skips the “best practices” fluff and focuses on the metrics that actually matter.
1. The “Ghost” Bottlenecks (Check These First)
Before diving into deep metrics, rule out the configuration issues that masquerade as performance faults. These are the most common “silent killers” in vSphere environments.
VMware Tools Compliance It is not optional. A VM without updated Tools is using emulated legacy drivers for network and disk, effectively running with one hand tied behind its back.
- The Fix: Don’t check them one by one. Use PowerCLI to find the offenders immediately.
- Command:
Get-VM | Where-Object {$_.ExtensionData.Guest.ToolsStatus -ne "toolsOk"} | Select Name, @{N="Tools Status";E={$_.ExtensionData.Guest.ToolsStatus}}
Power Management Policies By default, ESXi often runs in “Balanced” mode. This saves power by aggressively down-clocking the CPU, which introduces latency (“wake-up time”) for bursty workloads.
- The Fix: Set the Host BIOS and the ESXi Power Management policy to High Performance. The electricity bill might go up slightly, but the random latency spikes will disappear.
NUMA and “Hot Add” Enabling “CPU Hot Add” on a VM disables vNUMA for that guest. This means the OS has no idea which memory is local to its CPU, leading to slow remote memory access. Unless you plan to add CPUs to a live VM daily, disable Hot Add on all performance-critical workloads.
2. CPU: It’s Usually Co-Stop, Not Lack of GHz
High CPU usage on a host (>75%) is obvious, but the real killer is often Contention, not capacity. You need to look at two specific metrics in esxtop or Advanced Charts.
%RDY (CPU Ready) This measures the time a VM wants to execute but the hypervisor has no physical core available.
- Threshold: > 5% (or > 2,000 milliseconds).
- Meaning: The host is overcommitted, or a “limit” is set on the VM. Check for artificial limits immediately using PowerCLI.
%CSTP (Co-Stop) This is the “Wide VM” penalty. If you assign 8 vCPUs to a VM, the scheduler must find 8 free physical cores simultaneously to execute a cycle. If it can’t, the VM waits.
- Threshold: > 3%.
- Meaning: Your VM is too big. Reducing vCPUs from 8 to 4 often increases performance because the scheduler can run the VM more frequently. This is called “Right-Sizing”.
3. Storage: The Latency Formula
Storage troubleshooting often devolves into finger-pointing between the Server Admin and the Storage Admin. Use this formula to determine exactly where the problem lies:
GAVG (Total Guest Latency) = KAVG (Kernel Latency) + DAVG (Device Latency)
- High KAVG (> 2ms): The problem is inside the Host.
- Cause: The ESXi storage queues are full. Too many high-IOPS VMs are sharing the same LUN/Datastore.
- Fix: Increase the HBA Queue Depth or migrate VMs to different datastores to spread the load.
- High DAVG (> 20ms): The problem is the Storage Array or Fabric.
- Cause: The array is slow, a drive has failed, or the Fibre Channel/iSCSI network is congested.
- Fix: Send a screenshot of the DAVG metric to your storage team. It is irrefutable proof the array is the bottleneck.
The Snapshot Trap Snapshots are performance poison. Each snapshot forces the hypervisor to traverse a delta disk chain for every read operation.
- Rule: Never run production VMs on snapshots for more than 24-48 hours. If you see high latency on a specific VM, check for “forgotten” snapshots first.
4. Memory: Ballooning vs. Swapping
Do not panic if you see high memory usage in vCenter; vSphere will use all available RAM for caching. Panic only when reclamation starts.
- Ballooning (MCTLSZ > 0): The host is under pressure and is inflating the balloon driver to reclaim RAM from guests. This causes sluggishness but is recoverable.
- Swapping (SWCUR > 0): The host has exhausted physical RAM and is paging memory to disk. This is a hard performance stop.
- Fix: If a host is swapping, you must migrate VMs away (vMotion) immediately or shut down non-critical workloads. There is no tuning fix for swapping other than adding hardware or reducing load.
5. Network: The CPU Correlation
Network issues are rarely just about bandwidth. If you see dropped packets (%DRPTX/%DRPRX in esxtop), check the CPU first.
- The CPU-Net Link: If a host CPU is pegged at 100%, it cannot process network interrupts fast enough, leading to packet loss.
- Dropped Packets: If CPU is low but drops are high, check your ring buffers or physical cabling. But 9 times out of 10, “network lag” on a saturated host is actually CPU saturation.
6. Advanced Storage Diagnostics: Beyond Latency
If latencies (GAVG/KAVG) look fine but performance is still crawling, the issue isn’t congestion—it’s the I/O Pattern.
The vscsiStats Tool Most admins stop at esxtop. But if you need to know why a disk is slow, you need vscsiStats. It reveals if your “sequential” database backup is actually generating millions of random 4k writes.
- Use Case: When storage vendors claim “the array is fine” but the VM is slow. Use this to prove the I/O profile matches (or mismatches) the storage class (e.g., sending random I/O to spinning disks).
SIOC (Storage I/O Control) & Artificial Limits SIOC is designed to be a referee during contention, but it is often the cause of the problem.
- The Trap: A “Limit” set on a VM’s virtual disk (IOPS limit) is absolute. It applies even when the array is empty.
- Check: Verify no accidental IOPS limits were applied during a previous troubleshooting session and forgotten.
7. Network Traffic Shaping (NIOC)
If you use a Distributed Switch (VDS), Network I/O Control (NIOC) is likely active.
The “Shares” vs. “Limits” Trap
- Shares only kick in when the link is saturated. They are safe.
- Limits are hard caps. If you set a limit on “Virtual Machine Traffic” to ensure Management traffic gets through, you might be artificially throttling your production VMs 24/7.
- The Fix: Audit your VDS settings. Unless you have a very specific QoS requirement, remove all hard Limits on system traffic classes.
8. Baselines: “Slow” is Relative
You cannot troubleshoot “slowness” if you don’t know what “normal” looks like.
The “Is this normal?” Test
- Scenario: A Host is at 90% CPU. Is that a problem?
- Context: If it usually runs at 50%, yes. If it’s a number-crunching cluster that always runs at 90%, then no—performance might be fine.
- Tool: Use Aria Operations or vCenter’s historical charts. Don’t just look at “Realtime.” Look at the last 30 days. If the spike correlates with a recent software update or config change, you have your smoking gun.
9. The “Update” Reality Check
Driver/Firmware Mismatch The most common “unexplained” purple screen of death (PSOD) or packet loss issue is a driver/firmware mismatch.
- The Rule: Do not just update ESXi. You must check the Hardware Compatibility List (HCL). Running a new ESXi driver on old NIC firmware (or vice versa) is a recipe for dropped packets.
- vSphere 8 Specifics: If you are on vSphere 8, note that DRS runs on a faster cycle and is more aggressive. If VMs are vMotioning too frequently (creating overhead), you may need to tune the DRS migration threshold.
10. The esxtop Cheatsheet
When vCenter charts are too slow or averaging out the spikes, esxtop is the truth. It runs directly on the host shell and updates in real-time.
The Navigation Map Don’t memorize the manual. You only need these keystrokes to navigate the views:
- c – CPU View (Look for %RDY, %CSTP)
- m – Memory View (Look for MCTLSZ, SWCUR)
- n – Network View (Look for %DRPTX, %DRPRX)
- d – Disk Adapter View (Look for DAVG/cmd, KAVG/cmd)
- u – Disk Device View (Granular LUN view)
- v – VM View (Filters output to just VMs)
The “Ghost” Fields By default, esxtop hides some columns. Press f to add/remove fields.
- In CPU View (c): Ensure you see %RDY and %CSTP.
- In Memory View (m): Ensure you see NUMA stats (%L) if running large VMs.
11. Context: The “What Changed?” Rule
Metrics tell you what is breaking, but context tells you why. Before rebuilding your storage array, ask the simple questions that usually solve 90% of “sudden” performance drops:
- The Backup Window: Is a backup job stuck or overrunning into production hours?
- The Security Agent: Did the security team push a new EDR/Antivirus agent that is scanning every disk read?
- The “Ghost” Change: Did someone migrate a test VM to this production host and leave it there?
Good record keeping is about knowing that “Host A” usually runs at 50% CPU, so seeing it at 90% today is actually an anomaly.
Conclusion
Performance troubleshooting is not an art, but rather a process of elimination.
You start with the physical layer (Host CPU, RAM, Network). If those are clear, you move to the configuration layer (Limits, Shares, Drivers). Finally, you check the VM layer (Snapshots, Tools, OS health).
Ignore the “best practices” that don’t apply to your specific workload. Focus on the hard metrics – %RDY, swapping, and latency. If those numbers are clean, your infrastructure is likely fine, and the problem is inside the application code.
Don’t guess. Measure.