Avi Kivity's blog

Thursday, August 25, 2011

C, assembly, and security

Let's look at the innocent C statement:

    a = b + c;

What could possible go wrong? Let us list the ways:

a, b, or c are the not the variables we want
We specified addition, but we wanted something else
The addition operation overflows
a and b are unsigned, while c is signed and negative; the result becomes unsigned
a has a smaller size than b or c; the assignment operation overflows
a is signed while b or c are unsigned; a becomes negative while b+c are positive
a is unsigned, while b and c are signed; again we have an overflow
We used the wrong indentation level and people are unhappy about it

We can't expect the language to prevent all of these errors, but can we make C safer by trapping some of them at least during runtime? Turns out we can't do that without sacrificing performance:

to handle (3), we need trapping signed addition and trapping unsigned addition instructions
to handle (4), we need a mixed signedness trapping add
to handle (5), we need a trapping store unsigned and trapping store signed instructions, which check that the value in the register fits into the memory location specified
ditto with (6) and (7)

these (and similar) issues show up regularly in security vulnerabilities; it's hard to fix them because the necessary processor instructions are not there. We could emulate them by using sequences of existing instructions, but that would bloat the code and hurt performance; since performance is something that can be benchmarked but security is not, we end up with exploitable code.

So why are those instructions missing? In the 70's and 80's when the industry was ramping up performance was a much greater concern than security. Code was smaller and easier to audit; CPU cycles were longer and therefore more important to conserve; networks were small and private; truly malicious attacks were rare.

An unvirtuous cycle followed: C tried to make the most of exising processors, so its semantics mimic the instruction set of those days. It then became wildly successful, so processors were optimized for running C code; naturally they implemented or optimized instructions which directly translated to C concepts. This made C even more popular.

A pair of examples from the x86 world are the INTO and BOUND instructions. INTO (INTerrupt on Overflow) can follow an addition or subtraction instruction, effectively turning it into a trapping signed instruction. BOUND performs an array subscript bounds checking, trapping if the index is out of bounds. But the first implementations were rarely used, so they were not optimized in later iterations of the processor. Finally, the 64-bit extensions to the x86 instruction set removed those two instructions for good.

Sunday, September 6, 2009

Nested vmx support coming to kvm

Almost exactly a year ago I reported on nested svm for kvm - a way to run hypervisors as kvm guests, on AMD hosts. I'm happy to follow up with the corresponding feature for Intel hosts - nested vmx.

Unlike the nested svm patchset, which was relatively simple, nested vmx is relatively complex. This is due to several reasons:

While svm uses a memory region to communicate between hypervisor and processor, vmx uses special instructions -- VMREAD and VMWRITE. kvm must trap and emulate the effect of these instructions, instead of allowing the guest to read and write as it pleases.
vmx is significantly more complex than svm: vmx uses 144 fields for hypervisor-to-processor communications, while svm gets along with just 91. All of those fields have to be virtualized. Note that nested virtualization must reconcile the way kvm uses those fields with the way its guest (which is also a hypervisor) uses those fields; this causes complexity to increase even more.
The nested vmx patchset implements support for Extended Page Tables (EPT) in the guest hypervisor, in addition to existing support in the host. This means that kvm must now support guest pagetables in the 32-bit format, 64-bit format, and now the EPT format.

Support for EPT in the guest deserves special mention, since it is critical for obtaining reasonable performance. Without nested EPT, the guest hypervisor will have to trap writes to guest page tables and context switches. The the guest hypervisor has to service those traps - by issuing the VMREAD and VMWRITE to communicate with the processor. Since those instructions must trap to kvm, any trap taken by the guest is multiplied by quite a large factor into kvm traps.

So how does nested EPT work?

Without nesting, EPT provides for two levels of address translation:

The first level is managed by the guest, and translates guest virtual addresses (gva) to guest physical addresses (gpa).
The second address translation level translates guest physical addresses into host physical adresses (hpa). This second level is managed by the host (kvm).

When nesting is introduced, we now have three levels of address translation:

Nested guest virtual address (ngva) to nested guest physical address (ngpa) (managed by the nested guest)
Nested guest physical address (ngpa) to guest physical address (gpa) (managed by the guest hypervisor)
Guest physical address (gpa) to host physical address (hpa) (managed by the host - kvm)

Given that the hardware only supports two levels of address translation, we need to invoke software wizardry. Fortunately, we already have code in kvm that can fold two levels of address translation into one - the shadow mmu.

The shadow mmu, which is used when EPT or NPT are not available, folds the gva→gpa→hpa translation into a single gva→hpa translation which is supported by hardware. We can reuse this code to fold the ngpa→gpa→hpa translation into a single ngpa→hpa. Since the hardware supports two levels, it will happily translate ngva→ngpa→hpa.

But what about performance? Weren't NPT and EPT introduced to solve performance problems with the shadow mmu? Shadow mmu performance depends heavily on the rate of change of the two translation levels folded together. Virtual address translations (gva→gpa or ngva→ngpa) do change very frequently, but physical address translations (ngpa→gpa or gpa→hpa) change only rarely, usually in response to a guest starting up or swapping activity. So, while the code is complex and relatively expensive, it will only be invoked rarely.

To summarize, nested vmx looks to be one of the most complicated features in kvm, especially if we wish to maintain reasonable performance. It is expected that it will take Orit Wasserman and the rest of the IBM team some time to mature this code, but once this work is complete, kvm users will be able to enjoy another unique kvm feature.

Wednesday, December 24, 2008

kvm userspace merging into upstream qemu

Recently, Anthony Liguori, one of the qemu maintainers has included kvm support into stock Qemu. This is tremendously important.

Why? you might ask. It has to do with how software forks are managed.

When a software project is forked, there are two ways to go about it. One can add new features, restructuring code along the way so that the new code fits in snugly. This allows you to easily make large changes, but has the side effect of diverging from the original code. Over time, it is no longer possible (or at least very difficult) to incorporate fixes and new features that evolved in the original code, since the two code bases are wildly different.

An alternative strategy is to add the new features in a way that makes as little impact as possible on the original code. This allows updating from the origin to pick up fixes and new features relatively frequently. The downside is that we become severely limited in the kind of changes we can make to our copy of Qemu without diverging too much.

We have mostly followed the second strategy. Adaptations to qemu were as small as possible, and we have "encouraged" non-kvm-specific changes to be contributed directly to qemu upstream. This kept the amount and scope of local modifications at a minimum.

But now that kvm has been merged, it is possible to make larger modifications to qemu in order to make it fit virtualization roles better. Live migration and virtio have already been merged. Device and cpu hotplug are on the queue. Deeper changes, like modifying how qemu manages memory and performs DMA, are pending. And, of course, kvm integration is much cleaner and more maintainable.

There is of course some friction involved. The new implementation has a few bugs and several missing features (for example, support for true SMP and Windows patching), so it will be rough for a while. However, once the transition is complete, kvm and qemu will be able to evolve at a faster pace, to the benefit of both.

Tuesday, September 2, 2008

Nested svm virtualization for kvm

Yesterday I found a nice surprise in my inbox - a post, by Alex Graf, adding support for virtualizing AMD's SVM instruction set when running KVM on AMD SVM.

What does this mean? up until now, when kvm virtualizes a processor, the guest sees a cpu that is similar to the host processor, but does not have virtualization extensions. This means that you cannot run a hypervisor that needs these virtualization extensions within a guest (you can still run hypervisors that do not rely on these extensions, such as VMware, but with lower performance). With the new patches, the virtualized cpu does include the virtualization extensions; this means the guest can run a hypervisor, including kvm, and have its own guests.

There are two uses that immediately spring to mind: debugging hypervisors and embedded hypervisors. Obviously having svm enabled in a guest means that one can debug a hypervisor in a guest, which is a lot easier that debugging on bare metal. The other use is to have a hypervisor running in the firmware at all times; up until now this meant you couldn't run another hypervisor on such a machine. With nested virtualization, you can.

The reason the post surprised me was the relative simplicity in which nested virtualization was implemented: less than a thousand lines of code. This is due to the clever design of the svm instruction set, and the ingenuity of the implementers (Alex Graf and Jörg Rödel) in exploiting the instruction set and meshing the implementation so well with the existing kvm code.

Tuesday, May 13, 2008

How kvm does security

Like most software, kvm does security in layers.

At the inner privilege layer is the kvm module. This code interacts directly with the guest and also has full access to the machine. If breached, a guest could potentially take over the host and any virtual machines running on it.

The outer privilege layer is qemu. While it is much larger than the kvm kernel module, it is relatively easy to contain a qemu breach so that it doesn't affect the rest of the host:

The kernel already protects itself from non-root user processes; if you run kvm as an unprivileged user, the kernel will not let you harm it.
Processes that run as different users are also restricted; so if you run each guest under a distinct user ID, more isolation is gained.
Mandatory access control systems such as selinux can be used to further restrict the damage that a breached qemu can inflict.

What are the most vulnerable submodules in kvm?

Probably the most critical piece is the x86 instruction emulator, which is invoked whenever the guest accesses I/O registers or the its page tables. This code weighs in at about 2000 lines.
If the kvm mmu can be tricked into mapping an arbitrary host page into guest memory, then the guest can potentially insert its own code into the kernel. The mmu is about 3000 lines in length, but it has been the subject of endless inspection, so it is likely a very difficult target.

So again the "reuse Linux" theme repeats: kvm leverages the existing Linux kernel both to reduce the attack surface presented to malicious guests, and also to contain the damage should a security breach occur.

Friday, May 2, 2008

Comparing code size

Starting with Linux 2.6.26, kvm supports four different machine architectures: x86, s390 (System Z, or mainframes), ia64 (Intel's Itanium), and embedded PowerPC processors. It is interesting to compare the size of the code supporting each architecture:

arch	lines
x86	17442
ia64	8154
s390	2509
ppc	2229

x86 is old and crufty; it supports three instruction sets and four paging modes; its long and successful history means that it needs the most kvm support code. There are two different virtualization extensions that kvm supports on x86 (Intel's VT and AMD's SVM). It is also the architecture that has been supported by kvm for the longest time. It is no surprise that it leads the pack by a significant amount.

ia64 is a newer architecture, but a quite complex one. The mechanism by which is supports virtualization, with a module loaded into the host kernel and a second module loaded into the guest address space, also adds complexity. So it comes in second, though far behind x86.

s390 is older (and probably far cruftier) than x86. But on the other hand, its hardware virtualization support is so mature and complete that a complete hypervisor fits in a fraction of the lines required for x86. Indeed, it will take a while until x86 can support 64-way guests.

ppc 44x, the embedded PowerPC variant targeted by kvm, has a simple software-managed tlb model, and the regular instruction set encoding favored by RISC processors, so it gets by with just a seventh of the amount of code required by x86.

As we add more features, kvm code size will continue to grow slowly, but the relative comparison will no doubt remain valid. And kvm will likely remain the smallest full virtualization solution available.

Sunday, April 27, 2008

KVM Forum 2008 Agenda posted

The near-final agenda for the KVM Forum 2008 has been posted! I'm pleased to see a well-rounded set of presentations, covering all aspects of kvm development.

If you're interested in kvm development, and haven't already, make sure to register now.

See you all in Napa!