Virtual machines and scalability

I already pointed out many problems regarding the comparison of virtual machines recently posted by IBM. But there is one topic which I thought required a separate post, namely scalability.

What is scalability?

Simply put, scalability is the ability to take advantage of having more CPUs, more memory, more disk, more bandwidth. If I put two CPUs to the task, do I get twice the power? It is not generally true. As Fred Brooks said, no matter how many women you put to the task, it still takes nine months to make a baby. On the other hand, with nine women, you can make nine babies in nine months. In computer terminology, we would say that making babies is a task where throughput (the total amount of work per time unit) scales well with the number of baby-carrying women, whereas latency (the time it takes to complete a request) does not.



Computer scalability is very similar. Different applications will care about bandwidth or latency in different ways. For example, if you connect to Google Maps, latency is the time it takes for Google to show the map, but in that case (unlike for pregnant women), it is presumably improved because Google Maps sends many small chunks of the map in parallel.

I have already written in an earlier post why I believe HP has a good track record with respect to partitioning and scalability.

Scalability of virtual machines

However, IBM has very harsh words against HP Integrity Virtual Machines (aka HPVM), and describes HPVM scalability as a “downside” of the product:

The downside here is scalability. With HP’s virtual machines, there is a 4 CPU limitation and RAM limitation of 64GB. Reboots are also required to add processors or memory. There is no support for features such as uncapped partitions or shared processor pools. Finally, it’s important to note that HP PA RISC servers are not supported; only Integrity servers are supported. Virtual storage adapters also cannot be moved, unless the virtual machines are shut down. You also cannot dedicate processing resources to single partitions.

I already pointed out factual errors in every sentence of this paragraph. But scalability is a more subtle problem, and it takes more explaining just to describe what the problems are, not to mention possible solutions… What matters is not just the performance of a single virtual machine when nothing else is running on the system. You also care about performance under competition, about fairness and balance between workloads, about response time to changes in demand.

The problem is that these are all contradictory goals. You cannot increase the performance of one virtual machine without taking something away from the others. Obviously, the CPU time that you give to one VM cannot be given to another one at the same time. Similarly, increasing the reactivity to fast-changing workloads also increases the risk of instability, as for any feedback loop. Finally, in a server, there is generally no privileged workload, which makes the selection of the “correct” answers harder to make than for workstation virtualization products.

Checkmark features vs. usefulness

Delivering good VM performance is a complex problem. It is not just a matter of lining up virtual CPUs. HPVM implements various safeguards to help ensure that a VM configuration will not just run, but run well. I don’t have as much experience with IBM micro-partitions, but it seems much easier to create configurations that are inefficient by construction. What IBM calls a “downside” of HPVM is, I believe, a feature.

Here is a very simple example. On a system with, say, 4 physical CPUs, HPVM will warn you if you try to configure more than 4 virtual CPUs:

bash-2.05b# hpvmmodify -P vm7 -c 8
HPVM guest vm7 configuration problems:
    Warning 1: The number of guest VCPUS exceeds server's physical cpus.
    Warning 2: Insufficient cpu resource for guest.
These problems may prevent HPVM guest vm7 from starting.
hpvmmodify: The modification process is continuing.


It seems like a sensible thing to do. After all, if you only have 4 physical CPUs, you will not get more power by adding more virtual CPUs. There are, however, good chances that you will get less performance, in any case where one virtual CPU waits on another. Why? Because you increased the chances that the virtual CPU you are waiting on is not actually running at the time you request its help, independently of the synchronization mechanism that you are using. So instead of getting a response in a matter of microseconds (the typical wait time for, say, spinlocks), you will get it in a matter of milliseconds (the typical time slice on most modern systems).

Now, the virtual machine monitor might be able to do something smart about some of the synchronization mechanisms (notably kernel-level ones). But there are just too many ways to synchronize threads in user-space. In other words, by configuring more virtual CPUs than physical CPUs, you simply increased the chances of performing sub-optimally. How is that a good idea?

IBM doesn’t seem to agree with me. First, in their article, they complain about HP vPars implementing a similar restriction: The scalability is also restricted to the nPartition that the vPar is created on. Also, the IBM user-interface lets you create micro-partitions that have too many virtual CPUs with nary a peep. You can create a micro-partition with 16 virtual CPUs on a 2-way host, as illustrated below. Actually, 16 virtual CPUs is almost the maximum on a two way host for another reason: there is a minimum of 0.1 physical CPU per virtual CPU in the IBM technology, and 16 * 0.1 is 1.6, which only leaves a meager 0.4 CPU for the virtual I/O server.

IBM Host configuration Too many CPUs

The problem is that no matter how I look at it, I can’t imagine how it would be a good thing to run 16 virtual CPUs on a 2-CPU system. To me, this approach sounds a lot like the Fantasia school of scalability. If you remember, in that movie, Mickey Mouse plays a sorcerer apprentice who casts a spell so that his broom will do his chores in his stead. But things rapidly get wildly out of control. When Mickey tries to ax the brooms to stop the whole thing, each fragment rapidly grows back into a full grown broom, and things go from bad to worse. CPUs, unfortunately, are not magic brooms: cutting a CPU in half will not magically make two full-size CPUs.

Performing well in degraded configurations

Now, I don’t want you to believe that I went all defensive because IBM found a clever way to do something that HPVM can’t do. Actually, even if HPVM warns you by default, you can still force it to start a guest in such a “stupid” configuration, using the -F switch of hpvmstart. And it’s not like HPVM systematically performs really badly in this case either.

For example, below are the build times for a Linux 2.6.25 kernel in a variety of configurations.

4-way guest running on a 4-way host, 5 jobs

[linux-2.6.26.5]# gmake -j5
real    5m25.544s
user    18m46.979s
sys     1m41.009s

8-way guest running on a 4-way host, 9 jobs

[linux-2.6.26.5]# time gmake -j9
real    5m38.680s
user    36m23.662s
sys     3m52.764s

8-way guest running on a 4-way host, 5 jobs

[linux-2.6.26.5]# time gmake -j5
real    5m35.500s
user    22m25.501s
sys     2m6.003s

As you can verify, the build time is almost exactly the same, whether the guest has 4 our 8 virtual CPUs. As expected, the additional virtual CPUs do not bring any benefit. In that case, the degradation exists, but it is minor. It is however relatively easy to build cases where the degradation would be much larger. Another observation is that running only enough jobs to keep 4 virtual CPUs busy actually improves performance: less time is spent for the virtual CPUs to wait on one another.

So, why do we even test such configurations or allow them to run, then? Well, there is always the possibility that a CPU goes bad, in which case the host operating system is most likely to disable it. When that happens, we may end up running with an insufficient number of CPUs. Even so, this is no reason to kill the guest. We still want to perform as well as we can, until the failed CPU is replaced with a good one.

In short, I think that HPVM is doing the right thing by telling you if you are about to do something that will not be efficient. However, in case you found yourself in that situation due to some unplanned event, such as a hardware failure, it still does the hard work to keep you up and running with the best possible performance.

Remaining balanced and fair

There is another important point to consider regarding the performance of virtual machines. You don’t want virtual machines to just perform well, you also care a lot about maintaining balance between the various workloads, both inside the virtual machine itself, and between virtual machines. This is actually very relevant to scalability, because multi-threaded or multi-processor software often scales worse when some CPUs run markedly slower than others.

Consider for example that you have 4 CPUs, and divide a task into four approximately equal-sized chunks. The task will only complete when all 4 sub-tasks are done. If one CPU is significantly slower, all other CPUs will have to wait for it. In some cases, such as ray-tracing, it may be easy enough for another CPU to pick up some of the extra work. For other more complicated algorithms, however, the cost of partitioning may be significant, and it may not pay off to re-partition the task in flight. And even when re-partitioning on the fly is possible, software is unlikely to have implemented it if it did not bring any benefit on non-virtual hardware.

Loading virtual machines little by little…

In order to get a better feeling for all these points, readers are invited to do the following very simple experiment with their favorite virtual machine technology. To maximize chances that you can run the experiment yourself, I will not base my experiment on some 128-way machine with 1TB of memory running 200 16-way virtual machines or anything über-expensive like that. Instead, I will consider the simplest of all cases involving SMP guests: two virtual machines VM1 and VM2, each with two virtual CPUs, running concurrently on a 2-CPU machine. What could possibly go wrong with that? Nowadays, this is something you can try on most laptops…

The experiment is equally simple. We will simply incrementally load the virtual machines with more and more jobs, and see what happens. When I ran the experiment, I used a simple CPU spinner program written in C that counts how many loops per second it can perform. The baseline, which I will refer to as “100%”, is the number of iterations that the program makes on a virtual machine, with another virtual machine sitting idle. This is illustrated below, with the process Process 1 running in VM1, colored in orange.

CPU 1 CPU 2
Process 1 Idle

Now, let’s say that you start another identical process in VM2. The ideal situation is that one virtual CPU for each virtual machine gets loaded at 100%, so that each process gets a 100% score. In other words, each physical CPU is dedicated to running a virtual CPU, but the virtual CPUs belong to different virtual machines. The sum of the scores is 200%, which is the maximum you can get on the machine, and the average is 100%. This is both optimal and fair. As far as I can tell, both HPVM and IBM micro-partitions implement this behavior. This is illustrated below, with VM1 in orange and VM2 in green.

CPU 1 CPU 2
Process 1 Process 2

However, this behavior is not the only choice. Versions of VMware up to version 3 used about a mechanism called co-scheduling, where all virtual CPUs must run together. As the document linked above shows, VMware was boasting about that technique, but the result was that as soon as one virtual CPU was busy, the other physical CPU had to be reserved as well. As a result, in our little experiment, each process would get 50% of its baseline, not 100%. This approach is fair, but hardly optimal since you waste half of the available CPU power. VMware apparently chose that approach to avoid dealing with the more complicated cases where one virtual CPU would wait for another virtual CPU that was not running at the time.

CPU 1 CPU 2
Process 1 Idle
Idle Process 2

Now, let’s fire a second process in VM1. This is where things get interesting. In that situation, VM1 has both its virtual CPUs busy, but VM2 has only one virtual CPU busy, the other being idle. There are many choices here. One is to schedule the two CPUs of VM1, then the two CPUs of VM2 (even if one is idle). This method is fair between the two virtual machines, but it reserves a physical CPU for an idle virtual CPU half of the time. As a result, all processes will get a score of 50%. This is fair, but suboptimal, since you get a total score of 150% when you could get 200%.

CPU 1 CPU 2
Process 1 Process 3
Process 2 Idle

In order to optimize things, you have to take advantage of that ‘idle’ spot, but that creates imbalance. For example, you may want to allocate CPU resources as follows:

CPU 1 CPU 2
Process 1 Process 2
Process 1 Process 3

This scenario is optimal, since the total CPU bandwidth available is 200%, but it is not fair: process 1 now gets twice as much CPU bandwidth as processes 2 and 3. In the worst case, the guest operating system may end up being confused by what is going on. So one solution is to balance things out over longer periods of time:

CPU 1 CPU 2
Process 1 Process 2
Process 1 Process 3
Process 3 Process 2

This solution is again optimal and fair: process 1, 2 and 3 each get 66% of a CPU, for a total of 200%. But other important performance considerations come into play. One is that we cannot keep all processes on a single CPU. Keeping processes bound to a given CPU improves cache and TLB usage. But in this example, one of the processes (at least) will have to jump from one CPU to the other, even if the guest operating system thinks that it’s bound to a single CPU.

Another big downside as far as scalability is concerned is with respect to inter-process communication. If processes 1 and 3 want to talk to one another in VM1, they can do so without waiting only half of the time, since during the other half, the other CPU is actually running a process that belongs to another virtual machine. A consequence is that the latency of this inter-process communication increase very significantly. As far as I can tell, this particular problem is the primary issue with the scalability of virtual machines. VMware tried to address it with co-scheduling, but we have seen why it is not a perfect solution. Statistically speaking, adding virtual CPUs increases the chances that the CPU you need will not be running, in particular when other virtual machines are competing for CPU time.

Actual scalability vs. simple metrics

This class of problems is the primary reason why HPVM limits the size of virtual machines. It is not that it doesn’t work. There are even workloads that scale really well, Linux kernel builds or ray-tracing being good examples. But even workloads that scale OK with a single virtual machine will no longer scale as well under competition. Again, virtual machine scalability is nowhere as simple as “add more virtual CPUs and it works”.

This is not just theory. We have tested the theory. Consider the graph below, which shows the results of the same benchmark run into a variety of configurations. The top blue line, which is almost straight, is perfect scalability, which you practically get on native hardware. The red line is HPVM under light competition, i.e. with other virtual machines running but mostly idle. In that case, HPVM scales well up to 16-way. The blue line below is HPVM under heavy competition. If memory serves me right, the purple line is fully-virtualized Xen… under no competition.

In other words, if HPVM supports 8 virtual CPUs today, it is because we believe that we can maintain useful scalability on demanding workloads and even under competition. We know, for having tested and measured it, that going beyond 8-way will usually not result in increased performance, only in increasing CPU churn.

One picture is worth 210 words

As we have shown, making the right decisions for virtual machines is not simple. Interestingly, even the very simple experiment I just described highlights important differences between various virtual machine technologies. After launching 10 processes on each guest, here is the performance of the various processes when running under HPVM. In that case, the guest is a Linux RedHat 4 server competing against an HP-UX partition running the same kind of workload. You can see the Linux scheduler granting time to all processes almost exactly fairly, although there is some noise. I suspect that this noise is the result of the feedback loop that Linux puts in place to ensure fairness between processes.

By contrast, here is how AIX 6.1 performs when running the same workload. As far as I can tell, IBM implements what looks like a much simpler algorithm, probably without any feedback loop. It’s possible that there is an option to enable fair share scheduling on AIX (I am much less familiar with that system, obviously). The clear benefit is that it is very stable over time. The downside is that it seems quite a bit unfair compared to what we see in Linux. Some processes are getting a lot more CPU time than others, and this over an extended period of time (the graph spans about 5 minutes).

The result shown in the graphs is actually a combination of the effect of the operating system and virtual machine schedulers. In the case of IBM servers, I must say that I’m not entirely sure about how the partition and process schedulers interact. I’m not even sure that they are distinct: partitions seem to behave much like processes in AIX. In the case of HPVM, you can see the effect of the host HP-UX scheduler on the total amount allocated to the virtual machine.

Conclusion

Naturally, this is a very simple test, not even a realistic benchmark. It is not intended to demonstrate the superiority of one approach over the other. Instead, it demonstrates that virtual machine scalability and performance is not as simple as counting how many CPUs your virtual machine software can allocate. There are large numbers of complicated trade-offs, and what works for one class of workloads might not work so well for others.

I would not be surprised if someone shows benchmarks where IBM scales better than HPVM. Actually, it should be expected: after all, micro-partitions are almost built into the processor to start with; the operating system has been rewritten to delegate a number of tasks to the firmware; AIX really runs paravirtualized. HPVM, on the other hand, is full virtualization, which implies higher virtualization costs. It doesn’t get any help from Linux or Windows, and only very limited help from HP-UX. So if anything, I expect IBM micro-partitions to scale better than HPVM.

Yet I must say that my experience has not confirmed that expectation. In the few tests I made, differences, if any, were in HPVM’s favor. Therefore, my recommendation is to benchmark your application and see which solution performs best for you. Don’t let others tell you what the result should be, not even me…

3 thoughts on “Virtual machines and scalability

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s