Donnerstag, 27. September 2012

Having Intel NICs serving ESX 5.x hosts? Watch out for interface errors causing service disruption!

Some weeks ago I noticed some of my virtual machines, residing on Dell PowerEdge R810s/R910s running on ESXi 5.0 Update 1, did not behave very well when it comes to networking. Symptoms were random ICMP ping drops, occasional TCP connection drops, also vMotions would sometimes fail. The service disruptions would become worse after a longer ESXi uptime.

Investigating further, I found several errors on some Intel NICs. Here's a sample output of ethtool:

~ # ethtool -S vmnic7 | grep err
[...]
     rx_fifo_errors: 2305
[...]
     rx_queue_0_csum_err: 9
~ # ethtool -S vmnic6 | grep err
[...]
     rx_fifo_errors: 9848
[...]
     rx_queue_0_csum_err: 0

Counters would sometimes rise dramatically, up serveral 10.000s, in just some minutes of heavy network load (e.g. multiple vMotions over multiple 1 GbE interfaces).

Broadcom interfaces and other hosts equipped solely w/ Broadcom interfaces - not using igb/e1000 driver - did not show any issues. Also the rx_fifo_errors and rx_queue_0_csum_errors would move from one interface to another after reboot, making it impossible to isolate potentially bad interfaces/adaptors!

On the Cisco switch side, there were no further indications besides some forgivable out-discards. Upgrading to the latest IOS release did not help, neither did a promising VMware e1000 fix (see http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2020668).

Working together with Dell ProSupport gave us the opportunity to test Broadcom NICs in the servers. Replaced 4 Intel Dual Port NICs w/ 4 Broadcom DP-NICs, aaaand... gotcha! Errors gone, production restored.

With the release of VMware ESXi 5.1, the situation is still the same. I was given the opportunity to test this already in my lab and at a customer's site (running IBM servers, ESX 5 and a bunch of Intel 82580/82571-based cards). The problems really seem to related to either Intel hardware or software.

For the time being my advice would be to monitor your ESXi hypervisor's NICs more closely when running on Intel (using ethtool) or opt for Broadcoms.

Also if anyone out there has fought this thru with VMware, a hardware vendor or Intel, I'd be happy to know the outcome.

Mittwoch, 21. März 2012

DELL MEM 1.1 and ESXi 5; don't forget the Storage Heartbeat!

I just ran into a (lab) situation where ESXi 5 hosts w/ 4 NICs, connected to Dell Equallogic arrays via two stacked switches, completely lost connectivity to the storage arrays when maintenance on the network was due. The hosts' NICs were equally spread over the switches, so one would not have expected unplanned outage in case a switch is rebooted.

So, who was the bad guy? The admin.

Turns out I forgot to add the Dell recommended "Storage Heartbeat" VMkernel port. Using the set-up script, add --heartbeat <IPinISCSIsubnet>, everything's fine now that the Equallogic group IPs can be reached via the newly configured VMkernel port which must not have a 1:1 vmk/vmnic binding.

For those of you who are to configure iSCSI on vSphere together with Equallogic arrays, this VMware KB article is worth reading. Follow the links for Dell configuration guides.