Sorry for my non-French readers, here is an article from LeMag IT about HPVM, who several colleagues forwarded me.
This is also probably the first time someone links to the Taodyne web site.
Welcome to the readers of LeMag IT who decided to follow the link to my blog…
Two days ago, I attended a conference in Paris on the future of virtualization in mission-critical environments. There was a presentation from Intel about the roadmap for Itanium and virtualization.
Stability vs. Innovation
Two things in this presentation reminded me of what Martin Fink calls the Unix paradox:
- Intel pointed out that Itanium is mission-critical, so they tend to be more conservative. For example, they use processes for Itanium that have already been proven on the x86 side. Like for Unix, there is a similar paradox for mission-critical processors.
- The cost equation is very different for mission-critical systems. For commodity hardware, the acquisition cost tends to dominate your thinking; for mission-critical hardware, it’s the (potential) cost of losing the system that drives you. So whereas in volume systems, you are ready to pay more for innovation, e.g. features or performance, in mission-critical systems, it’s for stability that you pay more.
Following the recent discussion with Eric Raymond, I thought that was yet another interesting angle about what “innovation” means to different people.
One step at a time, or “TIC-TAC-TOC”
Intel also reminded us of the TIC-TOC model they now use to release CPUs:
- TIC: change the process on a stable micro-architecture
- TOC: change the micro-architecture on a stable process
I think that a similar approach applies to how our customers want to upgrade their mission-critical software, something that I would call TIC-TAC-TOC:
- TIC: Change the infrastructure (e.g. machines, disks), keep OS and applications the same
- TAC: Change the applications, keep infrastructure and OS the same
- TOC: Change the OS, keep infrastructure and applications the same
Customers may, at their discretion, decide to do multiple steps at the same time. For example, they may use an infrastructure change as an opportunity to also upgrade their OS and applications. But as a vendor, we should be careful not to force them to de-stabilize more than one thing at once. It should be their choice, not ours.
When innovation is the problem
Historically, HP has been good at this. TIC: You update from PA-RISC to Itanium, and you can still run the same OS, still run your PA-RISC applications. TAC: Upgrade the applications, keeping everything else the same, and you get a healthy speed boost. TOC: Upgrade to 11iv3, another speed boost; Install HP Integrity Virtual Machines and you get the latest in virtualization features, even on 2002-vintage Itanium hardware. As far as I know, you can’t virtualize a POWER4, and you can’t get Live Partition Mobility on a POWER5 system.
But TIC-TAC-TOC is not a perfect solution. That model is painless for customers only if we can convince them to stay reasonably current in two out of three dimensions at any given time. That model breaks down for a customer who runs HP-UX 10.20 on PA-RISC and obsolete applications. Such customers feel left behind, and the leap of faith to move to current technology is so big that they are an easy prey for competitors.
So here is my interpretation of Martin Fink’s Unix paradox:
Stability + Innovation = Disruption
How to solve that equation is left as an exercise for the reader
What is scalability?
Simply put, scalability is the ability to take advantage of having more CPUs, more memory, more disk, more bandwidth. If I put two CPUs to the task, do I get twice the power? It is not generally true. As Fred Brooks said, no matter how many women you put to the task, it still takes nine months to make a baby. On the other hand, with nine women, you can make nine babies in nine months. In computer terminology, we would say that making babies is a task where throughput (the total amount of work per time unit) scales well with the number of baby-carrying women, whereas latency (the time it takes to complete a request) does not.
Computer scalability is very similar. Different applications will care about bandwidth or latency in different ways. For example, if you connect to Google Maps, latency is the time it takes for Google to show the map, but in that case (unlike for pregnant women), it is presumably improved because Google Maps sends many small chunks of the map in parallel.
I have already written in an earlier post why I believe HP has a good track record with respect to partitioning and scalability.
Scalability of virtual machines
However, IBM has very harsh words against HP Integrity Virtual Machines (aka HPVM), and describes HPVM scalability as a “downside” of the product:
The downside here is scalability. With HP’s virtual machines, there is a 4 CPU limitation and RAM limitation of 64GB. Reboots are also required to add processors or memory. There is no support for features such as uncapped partitions or shared processor pools. Finally, it’s important to note that HP PA RISC servers are not supported; only Integrity servers are supported. Virtual storage adapters also cannot be moved, unless the virtual machines are shut down. You also cannot dedicate processing resources to single partitions.
I already pointed out factual errors in every sentence of this paragraph. But scalability is a more subtle problem, and it takes more explaining just to describe what the problems are, not to mention possible solutions… What matters is not just the performance of a single virtual machine when nothing else is running on the system. You also care about performance under competition, about fairness and balance between workloads, about response time to changes in demand.
The problem is that these are all contradictory goals. You cannot increase the performance of one virtual machine without taking something away from the others. Obviously, the CPU time that you give to one VM cannot be given to another one at the same time. Similarly, increasing the reactivity to fast-changing workloads also increases the risk of instability, as for any feedback loop. Finally, in a server, there is generally no privileged workload, which makes the selection of the “correct” answers harder to make than for workstation virtualization products.
Checkmark features vs. usefulness
Delivering good VM performance is a complex problem. It is not just a matter of lining up virtual CPUs. HPVM implements various safeguards to help ensure that a VM configuration will not just run, but run well. I don’t have as much experience with IBM micro-partitions, but it seems much easier to create configurations that are inefficient by construction. What IBM calls a “downside” of HPVM is, I believe, a feature.
Here is a very simple example. On a system with, say, 4 physical CPUs, HPVM will warn you if you try to configure more than 4 virtual CPUs:
bash-2.05b# hpvmmodify -P vm7 -c 8 HPVM guest vm7 configuration problems: Warning 1: The number of guest VCPUS exceeds server's physical cpus. Warning 2: Insufficient cpu resource for guest. These problems may prevent HPVM guest vm7 from starting. hpvmmodify: The modification process is continuing.
It seems like a sensible thing to do. After all, if you only have 4 physical CPUs, you will not get more power by adding more virtual CPUs. There are, however, good chances that you will get less performance, in any case where one virtual CPU waits on another. Why? Because you increased the chances that the virtual CPU you are waiting on is not actually running at the time you request its help, independently of the synchronization mechanism that you are using. So instead of getting a response in a matter of microseconds (the typical wait time for, say, spinlocks), you will get it in a matter of milliseconds (the typical time slice on most modern systems).
Now, the virtual machine monitor might be able to do something smart about some of the synchronization mechanisms (notably kernel-level ones). But there are just too many ways to synchronize threads in user-space. In other words, by configuring more virtual CPUs than physical CPUs, you simply increased the chances of performing sub-optimally. How is that a good idea?
IBM doesn’t seem to agree with me. First, in their article, they complain about HP vPars implementing a similar restriction: The scalability is also restricted to the nPartition that the vPar is created on. Also, the IBM user-interface lets you create micro-partitions that have too many virtual CPUs with nary a peep. You can create a micro-partition with 16 virtual CPUs on a 2-way host, as illustrated below. Actually, 16 virtual CPUs is almost the maximum on a two way host for another reason: there is a minimum of 0.1 physical CPU per virtual CPU in the IBM technology, and 16 * 0.1 is 1.6, which only leaves a meager 0.4 CPU for the virtual I/O server.
The problem is that no matter how I look at it, I can’t imagine how it would be a good thing to run 16 virtual CPUs on a 2-CPU system. To me, this approach sounds a lot like the Fantasia school of scalability. If you remember, in that movie, Mickey Mouse plays a sorcerer apprentice who casts a spell so that his broom will do his chores in his stead. But things rapidly get wildly out of control. When Mickey tries to ax the brooms to stop the whole thing, each fragment rapidly grows back into a full grown broom, and things go from bad to worse. CPUs, unfortunately, are not magic brooms: cutting a CPU in half will not magically make two full-size CPUs.
Performing well in degraded configurations
Now, I don’t want you to believe that I went all defensive because IBM found a clever way to do something that HPVM can’t do. Actually, even if HPVM warns you by default, you can still force it to start a guest in such a “stupid” configuration, using the -F switch of hpvmstart. And it’s not like HPVM systematically performs really badly in this case either.
For example, below are the build times for a Linux 2.6.25 kernel in a variety of configurations.
4-way guest running on a 4-way host, 5 jobs[linux-184.108.40.206]# gmake -j5 real 5m25.544s user 18m46.979s sys 1m41.009s
8-way guest running on a 4-way host, 9 jobs[linux-220.127.116.11]# time gmake -j9 real 5m38.680s user 36m23.662s sys 3m52.764s
8-way guest running on a 4-way host, 5 jobs[linux-18.104.22.168]# time gmake -j5 real 5m35.500s user 22m25.501s sys 2m6.003s
As you can verify, the build time is almost exactly the same, whether the guest has 4 our 8 virtual CPUs. As expected, the additional virtual CPUs do not bring any benefit. In that case, the degradation exists, but it is minor. It is however relatively easy to build cases where the degradation would be much larger. Another observation is that running only enough jobs to keep 4 virtual CPUs busy actually improves performance: less time is spent for the virtual CPUs to wait on one another.
So, why do we even test such configurations or allow them to run, then? Well, there is always the possibility that a CPU goes bad, in which case the host operating system is most likely to disable it. When that happens, we may end up running with an insufficient number of CPUs. Even so, this is no reason to kill the guest. We still want to perform as well as we can, until the failed CPU is replaced with a good one.
In short, I think that HPVM is doing the right thing by telling you if you are about to do something that will not be efficient. However, in case you found yourself in that situation due to some unplanned event, such as a hardware failure, it still does the hard work to keep you up and running with the best possible performance.
Remaining balanced and fair
There is another important point to consider regarding the performance of virtual machines. You don’t want virtual machines to just perform well, you also care a lot about maintaining balance between the various workloads, both inside the virtual machine itself, and between virtual machines. This is actually very relevant to scalability, because multi-threaded or multi-processor software often scales worse when some CPUs run markedly slower than others.
Consider for example that you have 4 CPUs, and divide a task into four approximately equal-sized chunks. The task will only complete when all 4 sub-tasks are done. If one CPU is significantly slower, all other CPUs will have to wait for it. In some cases, such as ray-tracing, it may be easy enough for another CPU to pick up some of the extra work. For other more complicated algorithms, however, the cost of partitioning may be significant, and it may not pay off to re-partition the task in flight. And even when re-partitioning on the fly is possible, software is unlikely to have implemented it if it did not bring any benefit on non-virtual hardware.
Loading virtual machines little by little…
In order to get a better feeling for all these points, readers are invited to do the following very simple experiment with their favorite virtual machine technology. To maximize chances that you can run the experiment yourself, I will not base my experiment on some 128-way machine with 1TB of memory running 200 16-way virtual machines or anything über-expensive like that. Instead, I will consider the simplest of all cases involving SMP guests: two virtual machines VM1 and VM2, each with two virtual CPUs, running concurrently on a 2-CPU machine. What could possibly go wrong with that? Nowadays, this is something you can try on most laptops…
The experiment is equally simple. We will simply incrementally load the virtual machines with more and more jobs, and see what happens. When I ran the experiment, I used a simple CPU spinner program written in C that counts how many loops per second it can perform. The baseline, which I will refer to as “100%”, is the number of iterations that the program makes on a virtual machine, with another virtual machine sitting idle. This is illustrated below, with the process Process 1 running in VM1, colored in orange.
|CPU 1||CPU 2|
Now, let’s say that you start another identical process in VM2. The ideal situation is that one virtual CPU for each virtual machine gets loaded at 100%, so that each process gets a 100% score. In other words, each physical CPU is dedicated to running a virtual CPU, but the virtual CPUs belong to different virtual machines. The sum of the scores is 200%, which is the maximum you can get on the machine, and the average is 100%. This is both optimal and fair. As far as I can tell, both HPVM and IBM micro-partitions implement this behavior. This is illustrated below, with VM1 in orange and VM2 in green.
|CPU 1||CPU 2|
|Process 1||Process 2|
However, this behavior is not the only choice. Versions of VMware up to version 3 used about a mechanism called co-scheduling, where all virtual CPUs must run together. As the document linked above shows, VMware was boasting about that technique, but the result was that as soon as one virtual CPU was busy, the other physical CPU had to be reserved as well. As a result, in our little experiment, each process would get 50% of its baseline, not 100%. This approach is fair, but hardly optimal since you waste half of the available CPU power. VMware apparently chose that approach to avoid dealing with the more complicated cases where one virtual CPU would wait for another virtual CPU that was not running at the time.
|CPU 1||CPU 2|
Now, let’s fire a second process in VM1. This is where things get interesting. In that situation, VM1 has both its virtual CPUs busy, but VM2 has only one virtual CPU busy, the other being idle. There are many choices here. One is to schedule the two CPUs of VM1, then the two CPUs of VM2 (even if one is idle). This method is fair between the two virtual machines, but it reserves a physical CPU for an idle virtual CPU half of the time. As a result, all processes will get a score of 50%. This is fair, but suboptimal, since you get a total score of 150% when you could get 200%.
|CPU 1||CPU 2|
|Process 1||Process 3|
In order to optimize things, you have to take advantage of that ‘idle’ spot, but that creates imbalance. For example, you may want to allocate CPU resources as follows:
|CPU 1||CPU 2|
|Process 1||Process 2|
|Process 1||Process 3|
This scenario is optimal, since the total CPU bandwidth available is 200%, but it is not fair: process 1 now gets twice as much CPU bandwidth as processes 2 and 3. In the worst case, the guest operating system may end up being confused by what is going on. So one solution is to balance things out over longer periods of time:
|CPU 1||CPU 2|
|Process 1||Process 2|
|Process 1||Process 3|
|Process 3||Process 2|
This solution is again optimal and fair: process 1, 2 and 3 each get 66% of a CPU, for a total of 200%. But other important performance considerations come into play. One is that we cannot keep all processes on a single CPU. Keeping processes bound to a given CPU improves cache and TLB usage. But in this example, one of the processes (at least) will have to jump from one CPU to the other, even if the guest operating system thinks that it’s bound to a single CPU.
Another big downside as far as scalability is concerned is with respect to inter-process communication. If processes 1 and 3 want to talk to one another in VM1, they can do so without waiting only half of the time, since during the other half, the other CPU is actually running a process that belongs to another virtual machine. A consequence is that the latency of this inter-process communication increase very significantly. As far as I can tell, this particular problem is the primary issue with the scalability of virtual machines. VMware tried to address it with co-scheduling, but we have seen why it is not a perfect solution. Statistically speaking, adding virtual CPUs increases the chances that the CPU you need will not be running, in particular when other virtual machines are competing for CPU time.
Actual scalability vs. simple metrics
This class of problems is the primary reason why HPVM limits the size of virtual machines. It is not that it doesn’t work. There are even workloads that scale really well, Linux kernel builds or ray-tracing being good examples. But even workloads that scale OK with a single virtual machine will no longer scale as well under competition. Again, virtual machine scalability is nowhere as simple as “add more virtual CPUs and it works”.
This is not just theory. We have tested the theory. Consider the graph below, which shows the results of the same benchmark run into a variety of configurations. The top blue line, which is almost straight, is perfect scalability, which you practically get on native hardware. The red line is HPVM under light competition, i.e. with other virtual machines running but mostly idle. In that case, HPVM scales well up to 16-way. The blue line below is HPVM under heavy competition. If memory serves me right, the purple line is fully-virtualized Xen… under no competition.
In other words, if HPVM supports 8 virtual CPUs today, it is because we believe that we can maintain useful scalability on demanding workloads and even under competition. We know, for having tested and measured it, that going beyond 8-way will usually not result in increased performance, only in increasing CPU churn.
One picture is worth 210 words
As we have shown, making the right decisions for virtual machines is not simple. Interestingly, even the very simple experiment I just described highlights important differences between various virtual machine technologies. After launching 10 processes on each guest, here is the performance of the various processes when running under HPVM. In that case, the guest is a Linux RedHat 4 server competing against an HP-UX partition running the same kind of workload. You can see the Linux scheduler granting time to all processes almost exactly fairly, although there is some noise. I suspect that this noise is the result of the feedback loop that Linux puts in place to ensure fairness between processes.
By contrast, here is how AIX 6.1 performs when running the same workload. As far as I can tell, IBM implements what looks like a much simpler algorithm, probably without any feedback loop. It’s possible that there is an option to enable fair share scheduling on AIX (I am much less familiar with that system, obviously). The clear benefit is that it is very stable over time. The downside is that it seems quite a bit unfair compared to what we see in Linux. Some processes are getting a lot more CPU time than others, and this over an extended period of time (the graph spans about 5 minutes).
The result shown in the graphs is actually a combination of the effect of the operating system and virtual machine schedulers. In the case of IBM servers, I must say that I’m not entirely sure about how the partition and process schedulers interact. I’m not even sure that they are distinct: partitions seem to behave much like processes in AIX. In the case of HPVM, you can see the effect of the host HP-UX scheduler on the total amount allocated to the virtual machine.
Naturally, this is a very simple test, not even a realistic benchmark. It is not intended to demonstrate the superiority of one approach over the other. Instead, it demonstrates that virtual machine scalability and performance is not as simple as counting how many CPUs your virtual machine software can allocate. There are large numbers of complicated trade-offs, and what works for one class of workloads might not work so well for others.
I would not be surprised if someone shows benchmarks where IBM scales better than HPVM. Actually, it should be expected: after all, micro-partitions are almost built into the processor to start with; the operating system has been rewritten to delegate a number of tasks to the firmware; AIX really runs paravirtualized. HPVM, on the other hand, is full virtualization, which implies higher virtualization costs. It doesn’t get any help from Linux or Windows, and only very limited help from HP-UX. So if anything, I expect IBM micro-partitions to scale better than HPVM.
Yet I must say that my experience has not confirmed that expectation. In the few tests I made, differences, if any, were in HPVM’s favor. Therefore, my recommendation is to benchmark your application and see which solution performs best for you. Don’t let others tell you what the result should be, not even me…
IBM just posted a comparison of virtual machines that I find annoyingly flawed. Fortunately, discussion on OSnews showed that technical people don’t buy this kind of “data”. Still, I thought that there was some value in pointing out some of the problems with IBM’s paper.
nPartitions are tougher than logical partitions
What HP calls ‘nPartitions’ are electrically isolated partitions. IBM correctly indicates that nPartitions offers true electrical isolation. The key benefit, using IBM’s own words, is that nPartitions allow you to service one partition while others are online. You can for example extract a cell, replace or upgrade CPUs or memory, and re-insert the cell without downtime.
The problem is when IBM states that this is similar to IBM logical partitioning. In reality, electrically-isolated partitions are tougher. Like on IBM systems, CPUs can be replaced in case of failure. But you can replace entire cells without shutting down the system, contrary to IBM’s claim that systems require a reboot when moving cells from one partition to another.
Finally, IBM remarks that Another downside is that entry-level servers do not support this technology, only HP9000 and Integrity High End and Midrange servers. Highly redundant, electrically isolated partitions have a cost, so they don’t make so much sense on entry level systems. But it’s not a downside of HP that they only have partitioning technologies similar to IBM’s on entry level systems, it’s a downside of IBM not to have anything similar to nPartitions on their high-end servers.
Integrity servers support more operating systems
IBM writes: It’s important to note that while nPartitions support HP-UX, Windows®, VMS, and Linux, they only do so on the Itanium processor, not on the HP9000 PA Risc architecture. It is true that HP (Microsoft, really) doesn’t support Windows on PA-RISC, a long obsolete architecture, but what is more relevant is that IBM doesn’t support Windows on POWER even today. This is the same kind of spin as for electrically-isolated partitions: IBM tries to divert attention from a missing feature in their product line by pointing out that there was a time when HP didn’t have the feature either.
IBM writes that Partition scalability also depends on the operating system running in the nPartition,. Again, this is true… on any platform. But the obvious innuendo here is that the scalability of Integrity servers is inconsistent across operating systems. A quick visit to the TPC-H and and TPC-C result pages will demonstrate that this is false. HP posts world-class results with HP-UX and Windows, and other Itanium vendors demonstrate the scalability of Linux on Itanium.
By contrast, the Power systems from IBM or Bull all run AIX. And if you want to talk about scalability, IBM as of today has no 30TB TPC-H results, and it often takes several IBM machines to compete with a single HP system.
Regarding scalability of other operating systems, HP engineers have put a lot of effort early into Linux scalability. I would go as far as saying that 64-way scalability of Linux is probably something that HP engineers tested seriously before anybody else, and they got some recognition for this work.
HP offers both isolation and flexibility
IBM also tries to minimize the flexibility of hard partitions. If this is electrically isolated, it can’t be flexible, right? That’s exactly what IBM wants you to believe: They also do not support moving resources to and from other partitions without a reboot.
However, HP has found a clever way to provide both flexibility and electrical isolation. With HP’s Instant Capacity (iCap), you can move “right to use” CPUs around. On such a high-end machine, it is a good idea to have a few spare CPUs, either in case of CPU failure or in case of a sudden peak in demand. HP recognized that, so you can buy machines with CPUs installed, but disabled. You pay less for these CPUs, you do not pay software licenses, and so on.
The neat trick that iCap enables is to transfer a CPU from one nPartition to another without breaking electrical isolation: you simply shut it down in one partition, which gives you the right to use it in another partition without paying anything for it. All this can be done without any downtime, and again, it preserved electrical isolation between the partitions.
Shared vs. dedicated
Regarding virtual partitions, IBM is quick to point out the “drawbacks”: What you can’t do with vPars is share resources, because there is no virtualized layer in which to manage the interface between the hardware and the operating systems. Of course, you cannot do that with dedicated resources on IBM systems either, that’s the point of having a partitioning system with dedicated resources! If you want shared resources, HP offers virtual machines, see below. So when IBM claims that This is one reason why performance overhead is limited, a feature that HP will market without discussing its clear limitations, the same also holds for dedicated resources on IBM systems.
What is true is that IBM has attempted to deliver in a single product what HP offers in two different products, virtual partitions (vPars) and virtual machines (HPVM). The apparent benefit is that you can configure systems that have a mix of dedicated and shared resources. For example, you can have in the same partition some disks directly attached to a partition like they would be in HP’s vPars, and other disks served through the VIO in a way similar to HPVM. In reality, though, the difference between the way IBM partitions and HPVM virtual machines work is not that big, and if anything, is going to diminish over time.
A more puzzling statement is that The scalability is also restricted to the nPartition that the vPar is created on. Is IBM really presenting as a limitation the fact that you can’t put more than 16 CPUs in a partition on a 16-CPU system? So I tested that idea. And to my surprise, IBM does indeed allow you to create for example a system with 4 virtual CPUs on a host with 2 physical CPUs. Interestingly, I know how to create such configurations with HPVM, but I also know that they invariably run like crap when put under heavy stress, so HPVM will refuse to run them if you don’t twist its arm with some special “force” option. I believe that it’s the correct thing to do. My quick experiments with IBM systems in such configurations confirms my opinion.
IBM also comments that There is also limited workload support; resources cannot be added or removed. Resources can be added or removed to a vPar, or moved between vPars, so I’m not sure what IBM refers to. Similarly, when IBM writes: Finally, vPars don’t allow you to share resources between partitions, nor can you dynamically allocate processing resources between partitions, I invite readers to check the transcript of a 3-years old demo to see for how long (at least) that statement has been false…
HPVM now supports 8 virtual CPUs, not 4, and it has always been able to take advantage of the much large numbers of physical CPUs HP-UX supports (128 today). Please note that 8 virtual CPUs is what is supported, not what works (the red line in the graph shows 16-way scalability of an HPVM virtual machine under moderate competition using a Linux scalability benchmark; the top blue line is the scalability on native hardware; the bottom curves are HPVM under maximal competition and Xen under no competition.)
IBM also states that Reboots are also required to add processors or memory. , but this is actually not very different from their own micropartitions. In a Linux partition with a kernel supporting CPU hotplug, for example, you can disable CPU3 under HPVM by typing echo 0 > /sys/devices/system/cpu/cpu3/online, just like you would on a real system (on HP-UX, you would use hpvmmgmt -c 3 to keep only 3 CPUs). HPVM also features dynamic memory in a virtual machine, meaning that you can add or remove memory while the guest is running. You need to reboot a virtual machine only to change the maximum number of CPUs or the maximum amount of memory, but then that’s also the case on IBM systems: to change the maximum number of CPUs, you need to modify a “profile”, but to apply that profile, you need to reboot. However, IBM micro-partitions have a real advantage (probably not for long) which is that you can boot with less than the maximum, whereas HPVM requires that the maximum resources be available at boot time.
IBM wants you to believe that There is no support for features such as uncapped partitions or shared processor pools. In reality, HPVM has always worked in uncapped mode by default. Until recently, you needed HP’s Global Workload Manager to enable capping, because we thought that it was a little too cumbersome to configure manually. Our customers told us otherwise, so you can now directly create virtual machines with capped CPU allocation. As for shared processor pools, I think that this is an IBM configuration concept that is entirely unnecessary on HP systems, as HPVM computes CPU placement dynamically for you. Let me know if you think otherwise…
The statement IBM made that Virtual storage adapters also cannot be moved, unless the virtual machines are shut down. is also untrue. All you need is to use hpvmmodify to add or remove virtual I/O devices. There are limits. For example, HPVM only provisions a limited number of spare PCI slots and busses, based on how many devices you started with. So if you start with one disk controller, you might have an issue going to the maximum supported configuration of 158 I/O devices without rebooting the partition. But if you know ahead of time that you will need that kind of growth, we even provide “null” backing stores that can be used for that purpose.
But the core argument IBM has against HP Virtual Machines is: The downside here is scalability. Scalability is a complex enough topic that it deserves a separate post, with some actual data…
Preserving the investment matters
Let me conclude by pointing out an important fact: HP has a much better track record at preserving the investment of their customers. For example, on cell-based systems, HP allowed mixed PA-RISC and Itanium cells to facilitate the transition.
This is particularly true for virtual machines. HPVM is software, that HP supports on any Integrity server running the required HP-UX (including systems by other vendors). By contrast, IBM micropartitions are firmware, with a lot of hardware dependency. Why does it matter? Because firmware only applies to recent hardware, whereas software can be designed to run on older hardware. As an illustration, my development and test environments consists in severely outdated zx2000 and rx5670 (the old 900MHz version). These machines were introduced in 2002 and discontinued in 2003 or 2004. In other words, I develop HPVM on hardware that was discontinued before HPVM was even introduced…
By contrast, you cannot run IBM micro-partitions on any POWER-4 system, and you need new POWER-6 systems in order to take advantage of new features such as Live Partition Mobility (the ability to transfer virtual machines from one host to another). HP customers who purchased hardware five years ago can use it to evaluate the equivalent HPVM feature today, without having to purchase brand new hardware.
Update: About history
The IBM paper boasts about IBM’s 40-plus year history of virtualization, trying to suggest that they have been in the virtualization market longer than anybody else. In reality, the PowerVM solution is really recent (less than 5 years old). Earlier partitioning solutions (LPARs) were comparable to HP’s vPars, and both technologies were introduced in 2001.
But a friend and colleague at HP pointed out that a lot of these more modern partitioning technologies actually originated at Digital Equipment, under the name OpenVMS Galaxy. And the Galaxy implementation actually offered features that have not entirely been matched yet, like APIs to share memory between partitions.
Following my CGO talk, I went to visit my colleagues in Nashua, New-Hampshire.
Nashua ZKO site shutting down
The so-called ZKO building is a historical landmark in the history of DEC. For the old-timers in computer science, @zko.dec.com was a pretty healthy thing to have in your e-mail address… Its walls are layered with pictures of early VMS luminaries. The Integrity VM project was fortunate to inherit some of this historical expertise that DEC (don’t say “Compaq” to these folks…) had built in operating systems. We have in the team people who not only know what TOPS-10 and TOPS-20 were, but actually wrote or maintained parts of it.
But after so many years, Hewlett-Packard will be closing ZKO soon, and transferring people to another HP building in Marlborough. As a result, many of my colleagues will have to move to new offices, and others decided to work from home. So I thought this was a good time to visit and meet the team in one place, while this was still possible.
Working remotely is hard
I have strong reservations about working from home or remotely, however. This is much harder than corporations seem to think. I’ve been doing it occasionally for almost 5 years now. Of course, part of the problem is that the HP VPN is sized for people doing PowerPoint and Outlook e-mail. It’s easily overloaded when too many people at once try to do something interactive, like a Remote Desktop or VNC session. So whenever I try to work from home, two days out of three, I end up returning to the office to be able to work in decent conditions. I hope that our team will not be impacted the same way.
There is also a whole dimension of face-to-face interaction that no amount of phone conversation or chat or e-mail can compensate for. Face to face, you become friends. Over e-mail, you are at best acquaintances. This is something that the proponents of teleworking fail to realize: just how many friends did you make over e-mail? how many over a beer, or a coffee, or just chatting face to face?
Another factor in my case is time zone differences. When the Nashua team starts working, it’s 3PM in France. When the Cupertino team starts working, it’s 6PM. I have regular meetings after 6PM two or three days a week, and a weekly meeting on Thursdays that finishes at 9PM and is killing me. No amount of technology really helps with that.
In any case, I was so happy to spend some quality time with the team. I met for the first time several people I had been working with for years! This was a treat.
Today, I gave a talk about Integrity Virtual Machines at the EPIC-7 workshop of the CGO-2008 conference. The audience was smaller than I hoped, but quite interesting. This was a good occasion to make new contacts, and to renew contacts with old friends. After 7 years working on Integrity VM, this was actually the first time I talked publicly about it outside of HP (although Todd Kjos gave a talk at Gelato in April 2007).
There were a number of other interesting talks. Rohit Bhatia presented Tukwila, the next generation Itanium. I did not learn much during this talk, but it was interesting to see the Intel take on this processor (I usually only hear the HP side of the story). This somewhat renewed my confidence about this platform. It’s a very interesting processor for a programmer, but that alone has never been a guarantee of success. Clearly, the initial expectations for Itanium were set a little bit too high, but it’s interesting to see that Intel doesn’t give up. And hearing them boast about Tukwila as “the fastest processor in the industry” opens some perspectives.
One thing that I’m still wondering about is how we can explain customers why Itanium will stick at relatively modest number of cores (e.g. four), whereas Larrabee will feature a much larger number (16 or 24 according to Wikipedia). Hmmm…
There has to be a last guy.
Yesterday, the last French soldier in World War I, Lazare Ponticelli, died at age 110.
In a totally different context, I find myself feeling a little bit like the “last guy standing” today. Todd Kjos found some new technical challenge to tackle and decided to move on. He was one of the three fathers of HPVM, and his contributions were invaluable.
Nothing lasts forever. We knew this was coming. Still, this is a sad day. What else to expect from a 13th… But it was a privilege working with guys like Todd and Jonathan Ross.
It’s not often that I post about my work, in part because this is a personal blog, in part because I don’t want to leak any information that would benefit our competitors, or be incorrectly interpreted by our customers.
But the graph on the right is an exception, because I was really happy to get that result. I am not really at liberty to tell what the graph represents, except that it’s a performance benchmark testing the behavior of HP Integrity VM and comparing it against a native machine and two variants of a competiting product generally praised (including inside HP) for its superior performance. Well, Integrity VM is the set of reddish curves, which means we behave pretty well, and that makes me quite happy. Hey, sometimes killing hype with hard data is all it takes to make my day…
Money money money
The talk is interesting from a historical perspective. You get Bill Gates’ personal perspective on the history. Like “software is where we belong”, because once it works, it keeps working. The reasoning is really: can we split between the easy money (software, where once it’s done, it’s pretty much like printing money) that we keep to ourselves, and complicated business, like making the chips or the computers themselves, where Microsoft will just “influence” other companies…
… and Bill Gates invented the Internet
If you listen to him, they really designed what is now known as the “IBM PC”, and IBM’s contribution to the whole thing is barely evoked. In this talk, he also says “I decided to reserve the top 384K, put video memory here, I/O there, and so on”… I wonder if this is really true, it might very well be. He definitely had the technical ability to do that. What I wonder is if IBM would just have taken Bill Gates’ blueprintes as is, or whether this was really a dialogue. It’s hard to tell, but there are things in this discourse that can be discounted based on what we know today, like “this MS/DOS software we had invented“.
32-bit will last us more than 10 years
Looking towards the future into what is now our past, Bill Gates makes a few questionable predictions, like “a 32-bit address space is going to last us more than 10 years. In reality, the first 64-bit CPUs from MIPS appeared in 1991, less than 2 years later. But then today, in 2007, we are only starting to see the 4GB limit as a limit for personal computers, and the transition to 64-bit is well underway. This time, it was prepared ahead of time.
Small teams are good
There are some remarkable insights as well. For example, at one point, Gates explains that the individual pieces of software are not valuable to Microsoft, because in a few years, “who cares”, there will be better software. So he sees the key asset of Microsoft as being the ability to keep small teams focused and efficient, so that they remain nimble and ahead of the competition. That is so very smart. I would bet that practically nobody understood that so clearly in 1989, and even today, big companies (including Microsoft itself, ironically), often forget this key characteristic of our industry.
To use an example close to home, the core functionality of HP Integrity VM was developed by only three people, myself included, over the span of 3 years. Better yet, it was a quite pleasant experience, and each of us commented that this was the best team he ever had worked in. As a matter of fact, all the successful projects I participated in started as “small things”, something that a single person can do by himself. At one point later in the project, the thing grows big enough that you can parallelize the work, and then it makes sense to build a larger team (and boy, did we ask, when this happened!). But even at that point, it still works much better if the larger team is made of “loosely connected subteams”. I see it as a key strength of open-source software: individual contributors are not forced on a schedule that does not match their own needs, they jump on the bandwagon of whatever release train suits them best. In my opinion, this makes a world of difference.
Microsoft recently published a video of their upcoming Longhorn virtualization. Interestingly, they demonstrate a feature that they claim no competitor can match, namely an 8-core virtual machine. I’m really curious to see the final product. Being in the field, I know exactly what kind of problem virtual machines run into with scalability.
A key problem is relatively simple to explain, and having explained it to many HP customers, I see no reason not to talk about it on this blog. When you schedule more than one virtual CPU, there are essentially two ways of doing it.
- You can use something called gang scheduling. In that case, all virtual CPUs run at the same time. One major benefit is that if a virtual CPU is waiting for a lock, the virtual CPU that holds the lock is also running. So there is no major increase in lock contention, a key factor in scalability. The major drawback is that if you have an 8-way virtual machine with 1 CPU busy, it actually consumes 8 CPUs. That is not very good use of the CPUs.
- An alternative is to schedule each virtual CPU individually. This is much more difficult to get right, in particular because of the lock contention issue outlined above. On the other hand, the benefit in terms of CPU utilization for virtual machines that use only some of their virtual CPUs are enormous.
So here is a test I’m interested in running on Longhorn virtualization when it becomes available: if I have two 4-way virtual machines on an 4-way hardware, and if I start one “spinner” process in each that counts as fast as possible, do I see:
- 4 processors pegged at 100%, with the two spinners running at half their nominal rate, since they each get essentially 50% of one CPU? As far as I know, this is the behavior for VMware, and it’s characteristic of gang scheduling.
- 2 processors pegged at 100% and 2 processors sitting mostly idle, with each spinner getting 100% of a CPU? This is the behavior of Integrity Virtual Machines.
The bottom line is: there is a serious trade off between scalability and efficient use of resources. If Microsoft managed to solve that problem, kudos to them.