Avi Kivity's blog: April 2008

Sunday, April 27, 2008

KVM Forum 2008 Agenda posted

The near-final agenda for the KVM Forum 2008 has been posted! I'm pleased to see a well-rounded set of presentations, covering all aspects of kvm development.

If you're interested in kvm development, and haven't already, make sure to register now.

See you all in Napa!

Friday, April 25, 2008

I/O: Maintainability vs Performance

I/O performance is of great importance to a hypervisor. I/O is also a huge maintenance burden, due to the large number of hardware devices that need to be supported, numerous I/O protocols, high availability options, and management for it all.

VMware opted for the performance option, but putting the I/O stack in the hypervisor. Unfortunately the VMware kernel is proprietary, so that means VMware has to write and maintain the entire I/O stack. That means a slow development rate, and that your hardware may take a while to be supported.

Xen took the maintainability route, by doing all I/O within a Linux guest, called "domain 0". By reusing Linux for I/O, the Xen maintainers don't have to write an entire I/O stack. Unfortunately, this eats away performance: every interrupt has to go through the Xen scheduler so that Xen can switch to domain 0, and everything has to go through an additional layer of mapping.

Not that Xen solved the maintainability problem completely: the Xen domain 0 kernel is still stuck on the ancient Linux 2.6.18 release (whereas 2.6.25 is now available). These problems have led Fedora 9 to drop support for hosting Xen guests, leaving kvm as the sole hypervisor.

So how does kvm fare here? like VMware, I/O is done within the hypervisor context, so full performance is retained. Like Xen, it reuses the entire Linux I/O stack, so kvm users enjoy the latest drivers and I/O stack improvements. Who said you can't have your cake and eat it?

Tuesday, April 15, 2008

Memory overcommit with kvm

kvm supports (or rather, will support; this is work in progress) several ways of running guests with more memory that you have on the host:

Swapping: This is the classical way to support overcommit; the host picks some memory pages from one of the guests and writes them out to disk, freeing the memory for use. Should a guest require memory that has been swapped, the host reads it back from the disk.
Ballooning: With ballooning, the guest and host cooperate on which page is evicted. It is the guest's responsibility to pick the page and swap it out if necessary.
Page sharing: The hypervisor looks for memory pages that have identical data; these pages are all merged into a single page, which is marked read only. If a guest writes to a shared page, it is unshared before granting the guest write access.
Live migration: The hypervisor moves one or more guests to a different host, freeing the memory used by these guests

Why does kvm need four ways of overcommitting memory? Each method provides different reliability/performance tradeoffs.

Ballooning is fairly efficient since it relies on the guest to pick the memory to be evicted. Many times the guest can simply shrink its cache in order to free memory, which can have a very low guest impact. The problem with ballooning is that it relies on guest cooperation, which reduces its reliability.

Swapping does not depend on the guest at all, so it is completely reliable from the host's point of view. However, the host has less knowledge than the guest about the guest's memory, so swapping is less performant than ballooning.

Page sharing relies on the guest behavior indirectly. As long as guests run similar applications, the host will achieve a high share ratio. But if a guest starts running new applications, the share ratio will decrease and free memory in the host will drop.

Live migration does not depend on the guest, but instead on the availablity of free memory on other hosts in the virtualization pool; if other hosts do not have free space, you cannot migrate to them. In addition, live migration takes time, which the host may not have when facing a memory shortage.

So kvm uses a mixed strategy: page sharing and ballooning are used as the preferred methods for memory overcommit since they are efficient. Live migration is used for long-term balancing of memory requirements and resources. Swapping is used as a last resort in order to guarantee that services to not fail.

Thursday, April 10, 2008

Paravirtualization is dead

Well, not all paravirtualization. I/O device virtualization is certainly the best way to get good I/O performance out of virtual machines, and paravirtualized clocks are still necessary to avoid clock-drift issues.

But mmu paravirtualization, hacking your guest operating system's memory management to cooperate with the hypervisor, is going away. The combination of hardware paging (NPT/EPT) and large pages match or beat paravirtualization on most workloads. Talking to a hypervisor is simply more expensive than letting the hardware handle everything transparently, even before taking into account the costs introduced by paravirtualization, like slower system calls.

The design of the kvm paravirtualized mmu reflects this planned obsolescence. Instead of an all or nothing approach, kvm paravirtualization is divided into a set of features which can be enabled or disabled independently. The guest picks the features it supports and starts using them.

The trick is that when the host supports NPT or EPT, kvm does not expose the paravirtualized mmu to the guest; in turn the guest doesn't use these features, and receives the benefit of the more powerful hardware. All this is done transparently without any user intervention.

Avi Kivity's blog