VIRTUALIZATION OF CPU, MEMORY, AND I/O DEVICES

VIRTUAL MACHINES AND

VIRTUALIZATION OF CLUSTERS AND DATA CENTERS

3.3 VIRTUALIZATION OF CPU, MEMORY, AND I/O DEVICES

3.3.1 Hardware Support for Virtualization

3.3.2 CPU Virtualization

3.3.2.1 Hardware-Assisted CPU Virtualization

3.3.3 Memory Virtualization

3.3.4 I/O Virtualization

3.3.5 Virtualization in Multi-Core Processors

3.3.5.1 Physical versus Virtual Processor Cores

3.3.5.2 Virtual Hierarchy

3.3 VIRTUALIZATION OF CPU, MEMORY, AND I/O DEVICES

To support virtualization, processors such as the x86 employ a special running mode and instructions, known as hardware-assisted virtualization.
In this way, the VMM and guest OS run in different modes and all sensitive instructions of the guest OS and its applications are trapped in the VMM.
To save processor states, mode switching is completed by hardware.
For the x86 architecture, Intel and AMD have proprietary technologies for hardware-assisted virtualization.

3.3.1 Hardware Support for Virtualization

Modern operating systems and processors permit multiple processes to run simultaneously.
If there is no protection mechanism in a processor, all instructions from different processes will access the hardware directly and cause a system crash.
Therefore, all processors have at least two modes, user mode and supervisor mode, to ensure controlled access of critical hardware.
- Instructions running in supervisor mode are called privileged instructions.
- Other instructions are unprivileged instructions.

In a virtualized environment, it is more difficult to make OSes and applications run correctly because there are more layers in the machine stack.

Popular hardware virtualization products:
VMware Workstation
- The VMware Workstation is a VM software suite for x86 and x86-64 computers.
- This software suite allows users to set up multiple x86 and x86-64 virtual computers and to use one or more of these VMs simultaneously with the host operating system.
- The VMware Workstation assumes the host-based virtualization.

Xen
- Xen is a hypervisor for use in IA-32, x86-64, Itanium, and PowerPC 970 hosts.
- Xen modifies Linux as the lowest and most privileged layer, or a hypervisor.
- One or more guest OS can run on top of the hypervisor.

KVM (Kernel-based Virtual Machine)
- KVM is a Linux kernel virtualization infrastructure.
- KVM can support
  - hardware-assisted virtualization (by using the Intel VT-x or AMD-v) and
  - para–virtualization (by using VirtIO framework).

3.3.2 CPU Virtualization

A VM is a duplicate of an existing computer system
majority of the VM instructions are executed on the host processor in native mode.
- Unprivileged instructions of VMs run directly on the host machine for higher efficiency.
- Other critical instructions should be handled carefully for correctness and stability.

The critical instructions are divided into three categories:
- privileged instructions,
  - Privileged instructions execute in a privileged mode and will be trapped if executed outside this mode.
- control sensitive instructions,
  - Control-sensitive instructions attempt to change the configuration of resources used
- behavior-sensitive instructions.
  - Behavior-sensitive instructions have different behaviors depending on the configuration of resources, including the load and store operations over the virtual memory.

A CPU architecture is virtualizable if it supports the ability to run the VM’s privileged and unprivileged instructions in the CPU’s user mode while the VMM runs in supervisor mode.

When the privileged instructions including control- and behavior-sensitive instructions of a VM are executed, they are trapped in the VMM.
- In this case, the VMM acts as a unified mediator for hardware access from different VMs to guarantee the correctness and stability of the whole system.

However, not all CPU architectures are virtualizable.
- RISC CPU architectures can be naturally virtualized because all control-sensitive and behavior-sensitive instructions are privileged instructions.
- On the contrary, x86 CPU architectures are not primarily designed to support virtualization.
  - This is because about 10 sensitive instructions, such as SGDT and SMSW, are not privileged instructions.
  - When these instructions execute in virtualization, they cannot be trapped in the VMM.

3.3.2.1 Hardware-Assisted CPU Virtualization

This technique attempts to simplify virtualization because full or paravirtualization is complicated.
Intel and AMD add an additional mode called privilege mode level (some people call it Ring-1) to x86 processors.
Therefore, operating systems can still run at Ring 0 and the hypervisor can run at Ring -1.
All the privileged and sensitive instructions are trapped in the hypervisor automatically.
This technique removes the difficulty of implementing binary translation of full virtualization.
It also lets the operating system run in VMs without modification.

3.3.3 Memory Virtualization

Virtual memory virtualization is similar to the virtual memory support provided by modern operating systems.
In a traditional execution environment, the operating system maintains mappings of virtual memory to machine memory using page tables, which is a one-stage mapping from virtual memory to machine memory.
All modern x86 CPUs include a memory management unit (MMU) and a translation lookaside buffer (TLB) to optimize virtual memory performance.

However, in a virtual execution environment, virtual memory virtualization involves sharing the physical system memory in RAM and dynamically allocating it to the physical memory of the VMs.

That means a two-stage mapping process should be maintained by the guest OS and the VMM, respectively:
- virtual memory to physical memory and
- physical memory to machine memory.

Furthermore, MMU virtualization should be supported, which is transparent to the guest OS.
The guest OS continues to control the mapping of virtual addresses to the physical memory addresses of VMs.
But the guest OS cannot directly access the actual machine memory.
The VMM is responsible for mapping the guest physical memory to the actual machine memory.

Figure 3.12 shows the two-level memory mapping procedure.

Since each page table of the guest OSes has a separate page table in the VMM corresponding to it, the VMM page table is called the shadow page table.
Nested page tables add another layer of indirection to virtual memory.
The MMU already handles virtual-to-physical translations as defined by the OS.
Then the physical memory addresses are translated to machine addresses using another set of page tables defined by the hypervisor.
Since modern operating systems maintain a set of page tables for every process, the shadow page tables will get flooded.
Consequently, the performance overhead and cost of memory will be very high.

VMware uses shadow page tables to perform virtual-memory-to-machine-memory address translation.
Processors use TLB hardware to map the virtual memory directly to the machine memory to avoid the two levels of translation on every access.
When the guest OS changes the virtual memory to a physical memory mapping, the VMM updates the shadow page tables to enable a direct lookup.

3.3.4 I/O Virtualization

I/O virtualization involves managing the routing of I/O requests between virtual devices and the shared physical hardware.
There are three ways to implement I/O virtualization:
- Full device emulation,
- Para-virtualization, and
- Direct I/O.

Full Device Emulation

All the functions of a device or bus infrastructure, such as device enumeration, identification, interrupts, and DMA, are replicated in software.
This software is located in the VMM and acts as a virtual device.
The I/O access requests of the guest OS are trapped in the VMM which interacts with the I/O devices.
The full device emulation approach is shown in Figure 3.14.

Issues – software emulation runs much slower than the hardware it emulates

Para-virtualization

The para-virtualization method of I/O virtualization is typically used in Xen.
It is also known as the split driver model consisting of a frontend driver and a backend driver.
- The frontend driver is running in Domain U
- The backend driver is running in Domain 0.
They interact with each other via a block of shared memory.
The frontend driver manages the I/O requests of the guest OSes
The backend driver is responsible for managing the real I/O devices and multiplexing the I/O data of different VMs.

Although para-I/O-virtualization achieves better device performance than full device emulation, it comes with a higher CPU overhead.

Direct I/O virtualization

Direct I/O virtualization lets the VM access devices directly.
It can achieve close-to-native performance without high CPU costs.
However, current direct I/O virtualization implementations focus on networking for mainframes. There are a lot of challenges for commodity hardware devices. For example, when a physical device is reclaimed (required by workload migration) for later reassignment, it may have been set to an arbitrary state (e.g., DMA to some arbitrary memory locations) that can function incorrectly or even crash the whole system.
Since software-based I/O virtualization requires a very high overhead of device emulation, hardware-assisted I/O virtualization is critical.

Hardware-assisted I/O virtualization Examples:

Intel VT-d supports the remapping of I/O DMA transfers and device-generated interrupts.
- The architecture of VT-d provides the flexibility to support multiple usage models that may run unmodified, special-purpose, or “virtualization-aware” guest OSes.

Another way to help I/O virtualization is via self-virtualized I/O (SV-IO).
The key idea of SV-IO is to harness the rich resources of a multicore processor.
All tasks associated with virtualizing an I/O device are encapsulated in SV-IO.
It provides virtual devices and
- access API to VMs and
- a management API to the VMM.

SV-IO defines one virtual interface (VIF) for every kind of virtualized I/O device, such as virtual network interfaces, virtual block devices (disk), virtual camera devices, and others.
The guest OS interacts with the VIFs via VIF device drivers.
Each VIF consists of two message queues.
One is for outgoing messages to the devices and the other is for incoming messages from the devices.
In addition, each VIF has a unique ID for identifying it in SV-IO.

3.3.5 Virtualization in Multi-Core Processors

Virtualizing a multi-core processor is relatively more complicated than virtualizing a uni-core processor.
Though multicore processors are claimed to have higher performance by integrating multiple processor cores in a single chip, muti-core virtualiuzation has raised some new challenges to computer architects, compiler constructors, system designers, and application programmers.

There are mainly two difficulties:
- Application programs must be parallelized to use all cores fully
  - Solution – new programming models, languages, and libraries are needed to make parallel programming easier
- Software must explicitly assign tasks to the cores, which is a very complex problem.
  - Ongoing research involving scheduling algorithms and resource management policies.

Issues –
- performance,
- complexity,
- dynamic heterogeneity of hardware infrastructure (Embedding fat CPU core and thin GPU cores on the same chip – which further complicates the multi-core or many-core resource management).

3.3.5.1 Physical versus Virtual Processor Cores

Wells, et al proposed a multicore virtualization method to allow hardware designers to get an abstraction of the low-level details of the processor cores.
This technique alleviates the burden and inefficiency of managing hardware resources by software.
It is located under the ISA and remains unmodified by the operating system or VMM (hypervisor).
Figure 3.16 illustrates the technique of a software-visible VCPU moving from one core to another and temporarily suspending execution of a VCPU when there are no appropriate cores on which it can run.

3.3.5.2 Virtual Hierarchy

Using Chip Multiprocessors (CMPs).
Instead of supporting time-sharing jobs on one or a few cores, we can use the abundant cores in a space-sharing, where
- single-threaded or multithreaded jobs are simultaneously assigned to separate groups of cores for long time intervals.

To optimize space-shared workloads, – use virtual hierarchies to overlay a coherence and caching hierarchy onto a physical processor.
This virtual hierarchy can adapt to fit how the work is space shared for improved performance and performance isolation.

Today’s many-core CMPs use a physical hierarchy of two or more cache levels that statically determine the cache allocation and mapping.
A virtual hierarchy is a cache hierarchy that can adapt to fit the workload or mix of workloads.
The hierarchy’s first level
- locates data blocks close to the cores needing them for faster access,
- establishes a shared-cache domain, and
- establishes a point of coherence for faster communication.
When a miss leaves a tile, it first attempts to locate the block (or sharers) within the first level.
The first level can also provide isolation between independent workloads.
A miss at the L1 cache can invoke the L2 access.

The idea is illustrated in Figure 3.17(a).

Space sharing is applied to assign three workloads to three clusters of virtual cores: namely
- VM0 and VM3 for database workload,
- VM1 and VM2 for web server workload, and
- VM4–VM7 for middleware workload.

The basic assumption is that each workload runs in its own VM.
However, space sharing applies equally within a single operating system.
Statically distributing the directory among tiles can do much better, provided operating systems or hypervisors carefully map virtual pages to physical frames.
Marty and Hill suggested a two-level virtual coherence and caching hierarchy that harmonizes with the assignment of tiles to the virtual clusters of VMs.

Figure 3.17(b) illustrates a logical view of such a virtual cluster hierarchy in two levels.

First level

Each VM operates in a isolated fashion at the first level.
This will minimize both miss access time and performance interference with other workloads or VMs.
Moreover, the shared resources of cache capacity, inter-connect links, and miss handling are mostly isolated between VMs.

Second level

The second level maintains a globally shared memory. This
- Facilitates dynamically repartitioning resources without costly cache flushes.
- Minimizes changes to existing system software and allows virtualization features such as content-based page sharing.
- Adapts to space-shared workloads like multiprogramming and server consolidation.