Monday, March 9, 2015

How Can Virtual Performance Beat Native Performance?


Since the inception of virtualization, it has been accepted that some amount of overhead gets added to any workload running in a virtual machine. Over time, VMware's focus has increased from consolidating workloads to handling mission-critical or tier-1 apps and moving on to high performance apps. When developing high-performance apps in any context, it is key for the workload to leverage the native hardware acceleration wherever possible. Scroll down for my "Virtualizing HPC and Latency-Sensitive Apps on vSphere Primer".

It is important to size the workload appropriate to the physical platform, matching physical to virtual NUMA node awareness and alignment for processes and memory, matching threads to cores, leveraging offloads of processing in the IO pipeline wherever possible, and that's just to start. From that, for the performance of a virtual machine to approach or beat a workload running on baremetal, it is key for the virtualization platform to expose as much of those feature-sets as it can. Perhaps most importantly the hypervisor should get out of the way whenever possible to let the VM's processes run without interruption.

However, whether you want to optimize for a latency-sensitive workload or a high-throughput workload will depend on how much you want the hypervisor to get out of the way. For high-throughput, you may want to let the VMs run as much as possible without interrupts. However, this may cause additional latency in the IO-path. For a latency-sensitive workload, you may want to disable interrupt coalescing, but you are deliberately servicing IO instead of focusing on compute. Remember that since you are trading off throughput and parallelization for latency, the settings and recommendations below should be evaluated and tested thoroughly to understand if they fit the workload. If you have a workload that prescribes both high throughput and low latency, you may have no choice but to adopt VMDirectPath or SR-IOV which have their own set of tradeoffs listed in the docs here: http://pubs.vmware.com/vsphere-55/index.jsp?topic=%2Fcom.vmware.vsphere.networking.doc%2FGUID-BF2770C3-39ED-4BC5-A8EF-77D55EFE924C.html

Along the way of VMware's hypervisor development, you could say there have been plenty of milestones that contribute to its performance-honed characteristics and features. A good yet not definitive list:
·      VMware's first product was Workstation, but ESX 1.0 was its first Type 1 hypervisor
·      ESX 3 introduced a service console VM where previously ESX had to statically assign IO devices
·      ESX 4.0 when VMDirectPath was introduced
·      ESXi 4.1 was when the service console VM was eliminated
·      ESXi 5 where the hypervisor was rewritten to become the best platform to run Cloud Foundry, an entirely new set of requirements around very fast provisioning and power-on of large numbers of virtual machines. Arguably this is where ESXi really learned how to get out of the way of a workload for near-native performance for most cases.
·      ESXi 5.1 introduced some of the latency sensitive tuning primitives but needed advanced options to set these for the vmkernel
·      ESXi 5.5 built on some of the latency sensitive tuning and granularity of the hypervisor to include a simple checkbox to indicate a VM as latency-sensitive

In addition, each major and minor version of ESX(i) has included hardware updates to include support for the latest and greatest chipsets from Intel and AMD, NICs and storage adapters. These advancements were accompanied by updates to the virtual hardware of a VM and the VMware Tools or in-guest set of drivers recommended for best performance and manageability.

By allowing the best translation of the native functionality and offloading of the underlying hardware, ESXi gets VMs to near-native performance for most throughput driven workloads. However, there are cases where benchmarks show that virtualized workloads can exceed the performance characteristics of their native equivalent. So how is this possible? To put another way, can the hypervisor be a better translation, management and scheduling engine for the hardware than an OS kernel itself? Why not keep the workload physical?

A workload running as baremetal will of course have direct access to all the hardware on that server, however, the sizing you will have to accept at that point is the size of the total CPUs and memory on that server and the sizing is static. However, with distributed systems or cloud-native apps or platform 3 apps, it is rarely about a single server. It is more about the aggregate performance across tens or hundreds or, for some, even thousands of servers, instead of the one server. In a discrete and multi-tenant (or "microservices" if you want) architecture, the requirements for dynamic and flexible sizing in aggregate is a natural fit for virtualization.

Virtualizing HPC and Latency-Sensitive Apps on vSphere Primer

Understanding the associated workload is critical, of course, in order to size the VM optimally. For traditional IT workloads, it was more likely to have to deal with oversized VMs. However, for high-performance apps, ESXtop can aid in determining if the VM is constrained by CPU, memory, storage or network. My ESXtop checklist is this kb article from VMware here: http://kb.vmware.com/kb/2001003.

For platform configuration checklist for high-performance workloads, see here, http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf:
·      Make sure the BIOS is updated and set for maximum performance. Your mileage may always vary due to the BIOS and firmware configuration of different components in the hardware. Even virtualized, these issues can still cause performance to lag.
·      C-states support should be disabled.
·      Power management in the BIOS should be disabled.
·      Use the latest stable ESXi version that you can. See the ESXi generation improvements above. The caveat being that the drivers for the hardware may differ per different ESXi versions which can cause poor or inconsistent performance results. Throughput testing when adding new drivers is definitely recommended.
·      Size VMs to fit within a NUMA node of the chipset. This will depend on the processor generation. For example, see here for Dell's recommendations for Haswell: http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/09/23/bios-tuning-for-hpc-on-13th-generation-haswell-server Also here's an older but still relevant article describing the NUMA affinity and migration techniques of the vmkernel: http://blogs.vmware.com/vsphere/2012/02/vspherenuma-loadbalancing.html
·      When sizing VMs, do not plan on overprovisioning any hardware component as any bottleneck will typically determine the overall performance and the aggregate performance will be less than optimal.
·      Choose the latest stable generation OS that you can. Later OS versions typically have more optimized hardware interrupt handling mechanisms. For example performance tuning recommendations for RHEL 7 (support subscription required) or for RHEL 6.
·      Use the latest version of VMware Tools which includes the latest paravirtualized drivers such as PVSCSI and VMXNET3.
·      Use PVSCSI when you can, but be careful of high throughput issues with default queue settings. For example, this kb article describes where the default queue depth for PVSCSI may be insufficient: http://kb.vmware.com/kb/2053145
·      Use VMXNET3 when you can and pay attention to how much you can offload to the hardware with regards to LRO, RSS, multiqueue and other NIC-specific optimizations. Some relevant VMware kb articles here: http://kb.vmware.com/kb/2020567 and http://kb.vmware.com/kb/1027511 http://www.vmware.com/files/pdf/VMware-vSphere-PNICs-perf.pdf
·      Low throughput on UDP in a Windows VM is another case to consider if your application IO depends on it. You may need to modify the vNIC settings: http://kb.vmware.com/kb/2040065
·      Overprovisioning the network capacity can be significantly trickier than sizing for CPU and memory, especially if using NFS for storage network traffic. It's key to understand whether this will be storage accessed by the vmkernel from SAN or NAS sources or by the VM itself. Use Network IO Control to enable more fair-sharing of network bandwidth. However understand this may cause more interrupts resulting in more overhead switching between VMs so if these are high throughput (compute and memory) VMs then consider placing them on separate hosts: http://www.vmware.com/files/pdf/techpaper/Network-IOC-vSphere6-Performance-Evaluation.pdf
·      If you have a particularly latency-sensitive workload, consider using SR-IOV or VMDirectPath. See the latest benchmarks here for Infiniband and RDMA http:/ /www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf
·      For Infiniband workloads, plan on using SR-IOV or VMDirectPath. The latest benchmarks are here: http://blogs.vmware.com/cto/running-hpc-applications-vsphere-using-infiniband/ http://blogs.vmware.com/cto/hpc-update/
·      For Nvidia GPGPU (general purpose or non-VDI) workloads, plan on using VMDirectPath. With vSphere 6, Nvidia vGRID (think of SR-IOV for GPUs instead of NICs) vGRID will be supported by VMware Horizon View so hopefully vGRID support for GPGPU workloads will be available soon. More details from Nvidia here: http://blogs.nvidia.com/blog/2015/02/03/vmware-nvidia-gpu-sharing/
·      For Xeon Phi, there is no support today on vSphere 5.5. This feature, the MIC or specialized "Many Integrated Core", is ignored by the hypervisor.

You'll also need to review in-guest OS settings and performance tuning variables which I won't detail in this post. And finally, you'll need to consider the application-specific tuning and optimizations. Given all of that, it should be possible to achieve better than native performance for certain high-performance workloads. For specific examples, see the excellent write-ups by VMware's performance team.

Hadoop on vSphere 6.0:
Redis on vSphere 6.0:

Links:
http://kb.vmware.com/kb/2020567 RSS MultiQueue Linux