A mismatch between ESXi load balancing policies and physical switch configurations is the #1 cause of host isolation. The classic scenario: A junior admin changes the policy to “Route based on IP Hash” without configuring a static EtherChannel on the physical switch first.
Result: The host drops off the network immediately. vCenter access is lost. The only way back is the local console (DCUI) or iDRAC.
This guide covers the emergency recovery runbook via esxcli and why you should probably avoid IP Hash in the first place.
The Emergency Fix (ESXCLI Runbook)
If you locked yourself out, stop guessing. Log into the physical console (or iDRAC/iLO), press Alt + F1 to access the shell, and log in as root.
1. Inspect the Damage
Check the vSwitch policy first. In most setups, the management network is on vSwitch0.
Bash esxcli network vswitch standard policy failover get -v vSwitch0
- Look for: The Load Balancing field.
- Diagnosis: If it says iphash but your switch ports aren’t trunked effectively, that is your problem.
2. Check for Overrides
Even if the vSwitch is correct, the “Management Network” port group might have a specific override.
Bash esxcli network vswitch standard portgroup policy failover get -p "Management Network"
3. The Fix: Revert to “Port ID”
To restore connectivity, force the policy back to the default “Route based on originating port ID”.
To fix the vSwitch:
Bash esxcli network vswitch standard policy failover set -v vSwitch0 -l portid
To fix the Port Group (if an override exists):
Bash esxcli network vswitch standard portgroup policy failover set -p "Management Network" -l portid
4. Verify
Ping the gateway to confirm you are back online:
Bash vmkping -I vmk0 <gateway_ip>
The Policies: What Actually Matters
Route based on originating port ID (Default)
- How it works: Maps a virtual NIC (vNIC) to a physical uplink. Traffic from that VM is “pinned” to that uplink.
- The Verdict: This is the correct setting for 95% of standard vSwitch deployments. It requires zero physical switch configuration (no EtherChannel/LACP).
- Limit: A single VM cannot exceed the speed of one physical link (e.g., 10Gbps), but the aggregate load of 50 VMs will naturally balance across all uplinks.
Route based on physical NIC load (LBT)
- Requirement: Distributed Switch (VDS) only.
- How it works: The VDS monitors physical uplink saturation. If an uplink exceeds 75% utilization, it moves flows to a less busy adapter.
- The Verdict: This is the Gold Standard for enterprise clusters. It provides dynamic load balancing without the complexity of LACP.
Route based on IP Hash
- Requirement: Static EtherChannel (LAG) on the physical switch.
- The Trap: If you turn this on before the switch is ready, you disconnect. If you use LACP with this policy on a Standard Switch, you disconnect.
- Verdict: Avoid unless you have a very specific bandwidth requirement for a single VM that exceeds one link.
Field Insights
1. LACP is a Trap (Mostly)
The overwhelming consensus in sysadmin communities is that LACP on ESXi is rarely worth the headache.
- Why? It adds a rigid dependency between the host and the switch. If you need to restore a host configuration or replace a switch, the LACP mismatch can leave you isolated.
- Better approach: Use Route based on physical NIC load (LBT). It achieves load balancing purely in software, keeping the physical layer dumb and reliable.
2. Management Network = Keep It Simple
Don’t mix complex load balancing with your Management Network.
- Best Practice: Use Explicit Failover (Active/Standby) for the Management VMkernel.
- Why: If your sophisticated LACP/LBT data network creates a loop or breaks, you need a simple, bulletproof “back door” to access the host.
3. The “Beacon Probing” Myth
Do not enable Beacon Probing if you only have 2 uplinks.
- The Risk: With only 2 NICs, if one fails, the host cannot determine which one is bad (Split Brain), leading to “flapping” where traffic is sent to the dead link.
- Rule: Use “Link Status Only” for 2 uplinks. Only consider Beacon Probing if you have 3+ uplinks.
Verdict
Technical skills on ESXi are about risk management. The esxcli commands above are your parachute when a policy change goes wrong. For day-to-day design, resist the urge to over-engineer. Use Originating Port ID for Standard Switches and Load Based Teaming (LBT) for Distributed Switches. Leave LACP for the networking team’s core switches, not your hypervisors.