Wednesday, December 24, 2008

kvm userspace merging into upstream qemu

Recently, Anthony Liguori, one of the qemu maintainers has included kvm support into stock Qemu. This is tremendously important.

Why? you might ask. It has to do with how software forks are managed.

When a software project is forked, there are two ways to go about it. One can add new features, restructuring code along the way so that the new code fits in snugly. This allows you to easily make large changes, but has the side effect of diverging from the original code. Over time, it is no longer possible (or at least very difficult) to incorporate fixes and new features that evolved in the original code, since the two code bases are wildly different.

An alternative strategy is to add the new features in a way that makes as little impact as possible on the original code. This allows updating from the origin to pick up fixes and new features relatively frequently. The downside is that we become severely limited in the kind of changes we can make to our copy of Qemu without diverging too much.

We have mostly followed the second strategy. Adaptations to qemu were as small as possible, and we have "encouraged" non-kvm-specific changes to be contributed directly to qemu upstream. This kept the amount and scope of local modifications at a minimum.

But now that kvm has been merged, it is possible to make larger modifications to qemu in order to make it fit virtualization roles better. Live migration and virtio have already been merged. Device and cpu hotplug are on the queue. Deeper changes, like modifying how qemu manages memory and performs DMA, are pending. And, of course, kvm integration is much cleaner and more maintainable.

There is of course some friction involved. The new implementation has a few bugs and several missing features (for example, support for true SMP and Windows patching), so it will be rough for a while. However, once the transition is complete, kvm and qemu will be able to evolve at a faster pace, to the benefit of both.

Tuesday, September 2, 2008

Nested svm virtualization for kvm

Yesterday I found a nice surprise in my inbox - a post, by Alex Graf, adding support for virtualizing AMD's SVM instruction set when running KVM on AMD SVM.

What does this mean? up until now, when kvm virtualizes a processor, the guest sees a cpu that is similar to the host processor, but does not have virtualization extensions. This means that you cannot run a hypervisor that needs these virtualization extensions within a guest (you can still run hypervisors that do not rely on these extensions, such as VMware, but with lower performance). With the new patches, the virtualized cpu does include the virtualization extensions; this means the guest can run a hypervisor, including kvm, and have its own guests.

There are two uses that immediately spring to mind: debugging hypervisors and embedded hypervisors. Obviously having svm enabled in a guest means that one can debug a hypervisor in a guest, which is a lot easier that debugging on bare metal. The other use is to have a hypervisor running in the firmware at all times; up until now this meant you couldn't run another hypervisor on such a machine. With nested virtualization, you can.

The reason the post surprised me was the relative simplicity in which nested virtualization was implemented: less than a thousand lines of code. This is due to the clever design of the svm instruction set, and the ingenuity of the implementers (Alex Graf and Jörg Rödel) in exploiting the instruction set and meshing the implementation so well with the existing kvm code.

Tuesday, May 13, 2008

How kvm does security

Like most software, kvm does security in layers.

At the inner privilege layer is the kvm module. This code interacts directly with the guest and also has full access to the machine. If breached, a guest could potentially take over the host and any virtual machines running on it.

The outer privilege layer is qemu. While it is much larger than the kvm kernel module, it is relatively easy to contain a qemu breach so that it doesn't affect the rest of the host:
  • The kernel already protects itself from non-root user processes; if you run kvm as an unprivileged user, the kernel will not let you harm it.
  • Processes that run as different users are also restricted; so if you run each guest under a distinct user ID, more isolation is gained.
  • Mandatory access control systems such as selinux can be used to further restrict the damage that a breached qemu can inflict.
What are the most vulnerable submodules in kvm?
  • Probably the most critical piece is the x86 instruction emulator, which is invoked whenever the guest accesses I/O registers or the its page tables. This code weighs in at about 2000 lines.
  • If the kvm mmu can be tricked into mapping an arbitrary host page into guest memory, then the guest can potentially insert its own code into the kernel. The mmu is about 3000 lines in length, but it has been the subject of endless inspection, so it is likely a very difficult target.
So again the "reuse Linux" theme repeats: kvm leverages the existing Linux kernel both to reduce the attack surface presented to malicious guests, and also to contain the damage should a security breach occur.

Friday, May 2, 2008

Comparing code size

Starting with Linux 2.6.26, kvm supports four different machine architectures: x86, s390 (System Z, or mainframes), ia64 (Intel's Itanium), and embedded PowerPC processors. It is interesting to compare the size of the code supporting each architecture:


x86 is old and crufty; it supports three instruction sets and four paging modes; its long and successful history means that it needs the most kvm support code. There are two different virtualization extensions that kvm supports on x86 (Intel's VT and AMD's SVM). It is also the architecture that has been supported by kvm for the longest time. It is no surprise that it leads the pack by a significant amount.

ia64 is a newer architecture, but a quite complex one. The mechanism by which is supports virtualization, with a module loaded into the host kernel and a second module loaded into the guest address space, also adds complexity. So it comes in second, though far behind x86.

s390 is older (and probably far cruftier) than x86. But on the other hand, its hardware virtualization support is so mature and complete that a complete hypervisor fits in a fraction of the lines required for x86. Indeed, it will take a while until x86 can support 64-way guests.

ppc 44x, the embedded PowerPC variant targeted by kvm, has a simple software-managed tlb model, and the regular instruction set encoding favored by RISC processors, so it gets by with just a seventh of the amount of code required by x86.

As we add more features, kvm code size will continue to grow slowly, but the relative comparison will no doubt remain valid. And kvm will likely remain the smallest full virtualization solution available.

Sunday, April 27, 2008

KVM Forum 2008 Agenda posted

The near-final agenda for the KVM Forum 2008 has been posted! I'm pleased to see a well-rounded set of presentations, covering all aspects of kvm development.

If you're interested in kvm development, and haven't already, make sure to register now.

See you all in Napa!

Friday, April 25, 2008

I/O: Maintainability vs Performance

I/O performance is of great importance to a hypervisor. I/O is also a huge maintenance burden, due to the large number of hardware devices that need to be supported, numerous I/O protocols, high availability options, and management for it all.

VMware opted for the performance option, but putting the I/O stack in the hypervisor. Unfortunately the VMware kernel is proprietary, so that means VMware has to write and maintain the entire I/O stack. That means a slow development rate, and that your hardware may take a while to be supported.

Xen took the maintainability route, by doing all I/O within a Linux guest, called "domain 0". By reusing Linux for I/O, the Xen maintainers don't have to write an entire I/O stack. Unfortunately, this eats away performance: every interrupt has to go through the Xen scheduler so that Xen can switch to domain 0, and everything has to go through an additional layer of mapping.

Not that Xen solved the maintainability problem completely: the Xen domain 0 kernel is still stuck on the ancient Linux 2.6.18 release (whereas 2.6.25 is now available). These problems have led Fedora 9 to drop support for hosting Xen guests, leaving kvm as the sole hypervisor.

So how does kvm fare here? like VMware, I/O is done within the hypervisor context, so full performance is retained. Like Xen, it reuses the entire Linux I/O stack, so kvm users enjoy the latest drivers and I/O stack improvements. Who said you can't have your cake and eat it?

Tuesday, April 15, 2008

Memory overcommit with kvm

kvm supports (or rather, will support; this is work in progress) several ways of running guests with more memory that you have on the host:

This is the classical way to support overcommit; the host picks some memory pages from one of the guests and writes them out to disk, freeing the memory for use. Should a guest require memory that has been swapped, the host reads it back from the disk.

With ballooning, the guest and host cooperate on which page is evicted. It is the guest's responsibility to pick the page and swap it out if necessary.

Page sharing
The hypervisor looks for memory pages that have identical data; these pages are all merged into a single page, which is marked read only. If a guest writes to a shared page, it is unshared before granting the guest write access.

Live migration
The hypervisor moves one or more guests to a different host, freeing the memory used by these guests

Why does kvm need four ways of overcommitting memory? Each method provides different reliability/performance tradeoffs.

Ballooning is fairly efficient since it relies on the guest to pick the memory to be evicted. Many times the guest can simply shrink its cache in order to free memory, which can have a very low guest impact. The problem with ballooning is that it relies on guest cooperation, which reduces its reliability.

Swapping does not depend on the guest at all, so it is completely reliable from the host's point of view. However, the host has less knowledge than the guest about the guest's memory, so swapping is less performant than ballooning.

Page sharing relies on the guest behavior indirectly. As long as guests run similar applications, the host will achieve a high share ratio. But if a guest starts running new applications, the share ratio will decrease and free memory in the host will drop.

Live migration does not depend on the guest, but instead on the availablity of free memory on other hosts in the virtualization pool; if other hosts do not have free space, you cannot migrate to them. In addition, live migration takes time, which the host may not have when facing a memory shortage.

So kvm uses a mixed strategy: page sharing and ballooning are used as the preferred methods for memory overcommit since they are efficient. Live migration is used for long-term balancing of memory requirements and resources. Swapping is used as a last resort in order to guarantee that services to not fail.

Thursday, April 10, 2008

Paravirtualization is dead

Well, not all paravirtualization. I/O device virtualization is certainly the best way to get good I/O performance out of virtual machines, and paravirtualized clocks are still necessary to avoid clock-drift issues.

But mmu paravirtualization, hacking your guest operating system's memory management to cooperate with the hypervisor, is going away. The combination of hardware paging (NPT/EPT) and large pages match or beat paravirtualization on most workloads. Talking to a hypervisor is simply more expensive than letting the hardware handle everything transparently, even before taking into account the costs introduced by paravirtualization, like slower system calls.

The design of the kvm paravirtualized mmu reflects this planned obsolescence. Instead of an all or nothing approach, kvm paravirtualization is divided into a set of features which can be enabled or disabled independently. The guest picks the features it supports and starts using them.

The trick is that when the host supports NPT or EPT, kvm does not expose the paravirtualized mmu to the guest; in turn the guest doesn't use these features, and receives the benefit of the more powerful hardware. All this is done transparently without any user intervention.

Friday, March 28, 2008

True myths

The appearance of kvm naturally provoked reactions from the competition, which are interesting in the way they imply some untruths while being 100% accurate:

  • kvm is good for desktop -- that is eminently true, by being integrated with Linux kvm inherits all the desktop and laptop goodies, like excellent power management, suspend/resume, good regular (non-virtual-machine) process performance, and driver integration.

    The implication, however, is that kvm is not suitable for server use. This is wrong: kvm also inherits from Linux its server qualities, including excellent scalability, advanced memory management, security, and I/O stack.

  • you need a bare metal hypervisor for server workloads -- that is also true, without complete control of the hardware a hypervisor will be hopelessly inefficient.

    Somehow the people who say this ignore the fact that kvm is a bare metal hypervisor, accessing the hardware directly. In fact kvm is much closer to the bare metal than Xen, which can only access I/O devices through a special guest, "dom0", which is definitely not running on bare metal.

  • A thin hypervisor gives better security -- true again, the smaller your trusted computing base is, the greater confidence you have in your hypervisor.

    The same speakers then go on about how thin Xen is. But they seem to ignore that the entire I/O and management plane is in fact a Linux guest -- and that it is part of the trusted computing base. Now which is smaller, Linux, or Xen with a trusted Linux guest?

Developers, of course, realize all of this immediately; but it will take some time and counter-marketing to repair the damage already done. Hence this article.