Sunday, September 6, 2009

Nested vmx support coming to kvm

Almost exactly a year ago I reported on nested svm for kvm - a way to run hypervisors as kvm guests, on AMD hosts. I'm happy to follow up with the corresponding feature for Intel hosts - nested vmx.

Unlike the nested svm patchset, which was relatively simple, nested vmx is relatively complex. This is due to several reasons:

  • While svm uses a memory region to communicate between hypervisor and processor, vmx uses special instructions -- VMREAD and VMWRITE. kvm must trap and emulate the effect of these instructions, instead of allowing the guest to read and write as it pleases.
  • vmx is significantly more complex than svm: vmx uses 144 fields for hypervisor-to-processor communications, while svm gets along with just 91. All of those fields have to be virtualized. Note that nested virtualization must reconcile the way kvm uses those fields with the way its guest (which is also a hypervisor) uses those fields; this causes complexity to increase even more.
  • The nested vmx patchset implements support for Extended Page Tables (EPT) in the guest hypervisor, in addition to existing support in the host. This means that kvm must now support guest pagetables in the 32-bit format, 64-bit format, and now the EPT format.

Support for EPT in the guest deserves special mention, since it is critical for obtaining reasonable performance. Without nested EPT, the guest hypervisor will have to trap writes to guest page tables and context switches. The the guest hypervisor has to service those traps - by issuing the VMREAD and VMWRITE to communicate with the processor. Since those instructions must trap to kvm, any trap taken by the guest is multiplied by quite a large factor into kvm traps.

So how does nested EPT work?

Without nesting, EPT provides for two levels of address translation:
  1. The first level is managed by the guest, and translates guest virtual addresses (gva) to guest physical addresses (gpa).
  2. The second address translation level translates guest physical addresses into host physical adresses (hpa). This second level is managed by the host (kvm).

When nesting is introduced, we now have three levels of address translation:
  1. Nested guest virtual address (ngva) to nested guest physical address (ngpa) (managed by the nested guest)
  2. Nested guest physical address (ngpa) to guest physical address (gpa) (managed by the guest hypervisor)
  3. Guest physical address (gpa) to host physical address (hpa) (managed by the host - kvm)
Given that the hardware only supports two levels of address translation, we need to invoke software wizardry. Fortunately, we already have code in kvm that can fold two levels of address translation into one - the shadow mmu.

The shadow mmu, which is used when EPT or NPT are not available, folds the gva→gpa→hpa translation into a single gva→hpa translation which is supported by hardware. We can reuse this code to fold the ngpa→gpa→hpa translation into a single ngpa→hpa. Since the hardware supports two levels, it will happily translate ngva→ngpa→hpa.

But what about performance? Weren't NPT and EPT introduced to solve performance problems with the shadow mmu? Shadow mmu performance depends heavily on the rate of change of the two translation levels folded together. Virtual address translations (gva→gpa or ngva→ngpa) do change very frequently, but physical address translations (ngpa→gpa or gpa→hpa) change only rarely, usually in response to a guest starting up or swapping activity. So, while the code is complex and relatively expensive, it will only be invoked rarely.

To summarize, nested vmx looks to be one of the most complicated features in kvm, especially if we wish to maintain reasonable performance. It is expected that it will take Orit Wasserman and the rest of the IBM team some time to mature this code, but once this work is complete, kvm users will be able to enjoy another unique kvm feature.


Andre said...

Fantastic! I'm developing a malware analysis framework, and I'd really like to use hardware virtualization for maximum stealth and flexibility. Not being able to analyse malware in an isolated virtual machine environment would render this approach useless though. Thus, the possibility of using (Intel-based) nested virtualization in KVM would be HUGE - as far as I know, no other vmms (VirtualBox, VMWare...) can do this.

Is nested vmx a planned feature that will come more or less with certainty, or is it something you are only investigating tentatively? I wasn't sure about that after reading your blog entry.

And while I know this is almost impossible to say ;-): Could you give a rough timeframe when nested vmx will appear in KVM? Is it more a matter of months, or years?


Avi Kivity said...

Nested vmx is a planned feature. I can't really estimate how long it will take (especially as I am not the one doing the work), but it will take at least a few months due to the complexity involved.

Anonymous said...

This looks really fun!

It also looks like theres more steps to take. Does that mean more overhead to go with the added complexity?

bkelly said...

Nested virtualization would be a fantastic feature for me - I would love to see it delivered. Whats the latest? I've not been able to find much documentation on the subject...

Thanks and keep up the good work!

Unknown said...

Is there any news on this subject? Is the nested vmx support coming soon or is there still much work involved?

Could you give some references to extra documentation/information about nesting in KVM and the difficulties with the vmx?

Anonymous said...

How hard would it be to leverage SVM/VMX nesting as a foundation for full trapping and emulation of these instructions on platforms (like Atom) which don't support these instructions. I'd like to use KVM across the board, but currently a larger percentage of the hardware that I use doesn't support these instructions.