Archive

Archive for the ‘Virtualization’ Category

Optimize!

Is it normal to wait for your computer? Why should I wait 5 seconds when I click on a menu? Why does it sometimes take half a minute to open a new document? Developers, optimize your code, if only as a matter of public service! What about making it a New Year resolution?

Why is my Mac laptop slower than my iPad?

Apple cares about iPad performance

Apple cares about iPad performance

I have a serious issue with the fact that on a laptop with 8G of RAM, 1TB of hard disk, a quad-core 2GHz i7, I spend my time waiting. All the time. For long, horribly annoying pauses.

Just typing these few paragraphs had Safari go into “pause” twice. I type something and it takes ten seconds or so with nothing showing up on screen, and then it catches up. Whaaaaat? How did programmers manage to write code so horribly that a computer with a quad-core 2.6GHz i7 can’t even keep up with my typing? Seriously? The Apple II, with its glorious 1MHz 8-bit 6502 never had trouble keeping up, no matter how fast I typed. Nor did Snow Leopard, for that matter…

Even today, why is it that I always find myself waiting for my Mac as soon as I have 5 to 10 applications open, when a poor iPad always feel responsive even with 20 or 30 applications open at the same time? Aren’t we talking about the same company (Apple)? About the same core operating system (Darwin being the core of both iOS and OSX)? So what’s the difference?

The difference, folks, is optimizations. Code for iOS is tuned, tight, fit. Applications are programmed with severe hardware limitations in mind. The iPad, for instance, is very good at “pausing” applications that you are not using and recalling them quickly when you switch to them. Also, most applications are very careful in their use of resources, in particular memory and storage. Apple definitely cares about the performance of the iPad. There was a time the performance of the Mac mattered as well, but that was a long time ago.

Boiled frog syndrome : we slowly got used to desktops or laptops being slower than tablets, but it’s just plain stupid.

Lion and Mountain Lion are Dog Slow

It's obvious why they called it Lion...

It’s obvious why they called it Lion…

I’ve been running every single version of MacOSX since the Rhapsody days. Up until Snow Leopard, each release was a definite improvement over the previous version. Lion and Mountain Lion, on the other hand, were a severe step backwards…

Lion and Mountain Lion were not just loaded with features I didn’t care about (like crippling my address book with Facebook email addresses), they didn’t just break features I relied on on a daily basis (like full screen applications that works with multiple monitors, or RSS feeds). They were slow.

We are not talking about small-scale slowness here. We are talking about molasses-fed slugs caught in a tar pit, of lag raised to an art form, of junk code piling up at an industrial scale, of inefficiency that makes soviet car design look good in comparison.

And it’s not just me. My wife and my kids keep complaining that “the machine lags”. And it’s been the case with every single machine I “upgraded” to Lion or Mountain Lion. To the point where I’m not upgrading my other machines anymore.

In my experience, the core issue is memory management. OSX Lion and Mountain Lion are much worse than their predecessors at handling multiple programs. On OSX, the primary rule of optimization seems to be “grab 1GB of memory first, ask questions later.” That makes sense if you are alone: RAM is faster than disk, by orders of magnitude, so copying stuff there is a good idea if you use it frequently.

But if you share the RAM with other apps, you may push those other apps away from memory, a process called “paging“. Paging depends very largely on heuristics, and has major impact on performance. Because, you see, RAM is faster than disk, by orders of magnitude. And now, this plays against you.

Here is an example of a heuristic that I believe was introduced in Lion: the OS apparently puts aside programs that you have not been using for a long while. A bit like an iPad, I guess. On the surface, this seems like a good idea. If you are not using them, free some memory for other programs. But this means that if I go away from my laptop and the screen saver kicks in, it will eat all available RAM and push other programs out. When I log back in… I have 3GB of free RAM and a spinning beach ball. Every time. And even if the screensaver does not run, other things like backupd (the backup daemon) or Spotlight surely will use a few gigabytes for, you know, copying files, indexing them, stuff.

Boiled frog syndrome : we slowly got used to programs using thousands of Mac128K worth of memory to do simple things like running a screensaver. It’s preposterous.

Tuning memory management is very hard

Virtual Memory is complicated

Virtual Memory is complicated

The VM subsystem, responsible for memory management, was never particularly good in OSX. I remember a meeting with an Apple executive back in the times OSX was called Rhapsody. Apple engineers were all excited about the new memory management, which was admittedly an improvement over MacOS9.

I told the Apple person I met that I could crash his Mac with 2 minutes at the keyboard, doing only things a normal user could do (i.e. no Terminal…) He laughed at me, gave me his keyboard and refused to even save documents. Foolish, that.

I went to the ancestor of Preview.app, opened a document, clicked on “Zoom” repeatedly until the zoom factor was about 6400% or so. See, in these times, the application was apparently allocating a buffer for rendering that was growing as you zoomed. The machine crawled to a halt, as it started paging these gigabytes in and out just to draw the preview on the screen. “It’s dead, Jim“, time to reboot with a long, hard and somewhat angry press on the Power button.

That particular problem was fixed, but not the underlying issue, which is a philosophical decision to take control away from users in the name of “simplicity“. OS9 allowed me to say that an App was supposed to use 8M of RAM. OSX does not. I wish I could say: “Screen Saver can use 256M of RAM. If it wants more, have it page to disk, not the other apps.” If there is a way to do that, I have not found it.

Boiled frog syndrome : we have slowly been accustomed by software vendors to give away control. But lack of control is not a feature.

Faster machines are not faster

A 1986 Mac beats a 2007 PC

A 1986 Mac beats a 2007 PC

One issue with poor optimizations is that faster machines, with much faster CPUs, GPUs and hard disks, are not actually faster to perform the tasks the user expects from them, because they are burdened with much higher loads. It’s as if developers always stopped at the limit of what the machine can do.

It actually makes business sense, because you get the most of your machine. But it also means its easy to push the machine right over the edge. And more to the point, an original 1986 Mac Plus will execute programs designed for it faster than a 2007 machine. I bet this would still hold in 2013.

So if you have been brainwashed by “Premature optimization is the root of all evil“, you should forget that. Optimizing is good. Optimize where it matters. Always. Or, as a colleague of mine once put it, “belated pessimization is the leaf of no good.”

Boiled frog syndrome : we have slowly been accustomed to our machines running inefficient code. But inefficiency is not law of nature. Actually, in the natural world, inefficiency gets you killed. So…

Optimize!

LeMag IT article about HPVM

Sorry for my non-French readers, here is an article from LeMag IT about HPVM, who several colleagues forwarded me.

This is also probably the first time someone links to the Taodyne web site.

Welcome to the readers of LeMag IT who decided to follow the link to my blog…

Stability and innovation

Two days ago, I attended a conference in Paris on the future of virtualization in mission-critical environments. There was a presentation from Intel about the roadmap for Itanium and virtualization.

Stability vs. Innovation

Two things in this presentation reminded me of what Martin Fink calls the Unix paradox:

  1. Intel pointed out that Itanium is mission-critical, so they tend to be more conservative. For example, they use processes for Itanium that have already been proven on the x86 side. Like for Unix, there is a similar paradox for mission-critical processors.
  2. The cost equation is very different for mission-critical systems. For commodity hardware, the acquisition cost tends to dominate your thinking; for mission-critical hardware, it’s the (potential) cost of losing the system that drives you. So whereas in volume systems, you are ready to pay more for innovation, e.g. features or performance, in mission-critical systems, it’s for stability that you pay more.

Following the recent discussion with Eric Raymond, I thought that was yet another interesting angle about what “innovation” means to different people.

One step at a time, or “TIC-TAC-TOC”

Intel also reminded us of the TIC-TOC model they now use to release CPUs:

  • TIC: change the process on a stable micro-architecture
  • TOC: change the micro-architecture on a stable process

I think that a similar approach applies to how our customers want to upgrade their mission-critical software, something that I would call TIC-TAC-TOC:

  • TIC: Change the infrastructure (e.g. machines, disks), keep OS and applications the same
  • TAC: Change the applications, keep infrastructure and OS the same
  • TOC: Change the OS, keep infrastructure and applications the same

Customers may, at their discretion, decide to do multiple steps at the same time. For example, they may use an infrastructure change as an opportunity to also upgrade their OS and applications. But as a vendor, we should be careful not to force them to de-stabilize more than one thing at once. It should be their choice, not ours.

When innovation is the problem

Historically, HP has been good at this. TIC: You update from PA-RISC to Itanium, and you can still run the same OS, still run your PA-RISC applications. TAC: Upgrade the applications, keeping everything else the same, and you get a healthy speed boost. TOC: Upgrade to 11iv3, another speed boost; Install HP Integrity Virtual Machines and you get the latest in virtualization features, even on 2002-vintage Itanium hardware. As far as I know, you can’t virtualize a POWER4, and you can’t get Live Partition Mobility on a POWER5 system.

But TIC-TAC-TOC is not a perfect solution. That model is painless for customers only if we can convince them to stay reasonably current in two out of three dimensions at any given time. That model breaks down for a customer who runs HP-UX 10.20 on PA-RISC and obsolete applications. Such customers feel left behind, and the leap of faith to move to current technology is so big that they are an easy prey for competitors.

So here is my interpretation of Martin Fink’s Unix paradox:

Stability + Innovation = Disruption

How to solve that equation is left as an exercise for the reader :-)

Cool high-level virtualization video

A good explanation of what my work is about:

Categories: Virtualization

Virtual machines and scalability

I already pointed out many problems regarding the comparison of virtual machines recently posted by IBM. But there is one topic which I thought required a separate post, namely scalability.

What is scalability?

Simply put, scalability is the ability to take advantage of having more CPUs, more memory, more disk, more bandwidth. If I put two CPUs to the task, do I get twice the power? It is not generally true. As Fred Brooks said, no matter how many women you put to the task, it still takes nine months to make a baby. On the other hand, with nine women, you can make nine babies in nine months. In computer terminology, we would say that making babies is a task where throughput (the total amount of work per time unit) scales well with the number of baby-carrying women, whereas latency (the time it takes to complete a request) does not.



Computer scalability is very similar. Different applications will care about bandwidth or latency in different ways. For example, if you connect to Google Maps, latency is the time it takes for Google to show the map, but in that case (unlike for pregnant women), it is presumably improved because Google Maps sends many small chunks of the map in parallel.

I have already written in an earlier post why I believe HP has a good track record with respect to partitioning and scalability.

Scalability of virtual machines

However, IBM has very harsh words against HP Integrity Virtual Machines (aka HPVM), and describes HPVM scalability as a “downside” of the product:

The downside here is scalability. With HP’s virtual machines, there is a 4 CPU limitation and RAM limitation of 64GB. Reboots are also required to add processors or memory. There is no support for features such as uncapped partitions or shared processor pools. Finally, it’s important to note that HP PA RISC servers are not supported; only Integrity servers are supported. Virtual storage adapters also cannot be moved, unless the virtual machines are shut down. You also cannot dedicate processing resources to single partitions.

I already pointed out factual errors in every sentence of this paragraph. But scalability is a more subtle problem, and it takes more explaining just to describe what the problems are, not to mention possible solutions… What matters is not just the performance of a single virtual machine when nothing else is running on the system. You also care about performance under competition, about fairness and balance between workloads, about response time to changes in demand.

The problem is that these are all contradictory goals. You cannot increase the performance of one virtual machine without taking something away from the others. Obviously, the CPU time that you give to one VM cannot be given to another one at the same time. Similarly, increasing the reactivity to fast-changing workloads also increases the risk of instability, as for any feedback loop. Finally, in a server, there is generally no privileged workload, which makes the selection of the “correct” answers harder to make than for workstation virtualization products.

Checkmark features vs. usefulness

Delivering good VM performance is a complex problem. It is not just a matter of lining up virtual CPUs. HPVM implements various safeguards to help ensure that a VM configuration will not just run, but run well. I don’t have as much experience with IBM micro-partitions, but it seems much easier to create configurations that are inefficient by construction. What IBM calls a “downside” of HPVM is, I believe, a feature.

Here is a very simple example. On a system with, say, 4 physical CPUs, HPVM will warn you if you try to configure more than 4 virtual CPUs:

bash-2.05b# hpvmmodify -P vm7 -c 8
HPVM guest vm7 configuration problems:
    Warning 1: The number of guest VCPUS exceeds server's physical cpus.
    Warning 2: Insufficient cpu resource for guest.
These problems may prevent HPVM guest vm7 from starting.
hpvmmodify: The modification process is continuing.


It seems like a sensible thing to do. After all, if you only have 4 physical CPUs, you will not get more power by adding more virtual CPUs. There are, however, good chances that you will get less performance, in any case where one virtual CPU waits on another. Why? Because you increased the chances that the virtual CPU you are waiting on is not actually running at the time you request its help, independently of the synchronization mechanism that you are using. So instead of getting a response in a matter of microseconds (the typical wait time for, say, spinlocks), you will get it in a matter of milliseconds (the typical time slice on most modern systems).

Now, the virtual machine monitor might be able to do something smart about some of the synchronization mechanisms (notably kernel-level ones). But there are just too many ways to synchronize threads in user-space. In other words, by configuring more virtual CPUs than physical CPUs, you simply increased the chances of performing sub-optimally. How is that a good idea?

IBM doesn’t seem to agree with me. First, in their article, they complain about HP vPars implementing a similar restriction: The scalability is also restricted to the nPartition that the vPar is created on. Also, the IBM user-interface lets you create micro-partitions that have too many virtual CPUs with nary a peep. You can create a micro-partition with 16 virtual CPUs on a 2-way host, as illustrated below. Actually, 16 virtual CPUs is almost the maximum on a two way host for another reason: there is a minimum of 0.1 physical CPU per virtual CPU in the IBM technology, and 16 * 0.1 is 1.6, which only leaves a meager 0.4 CPU for the virtual I/O server.

IBM Host configuration Too many CPUs

The problem is that no matter how I look at it, I can’t imagine how it would be a good thing to run 16 virtual CPUs on a 2-CPU system. To me, this approach sounds a lot like the Fantasia school of scalability. If you remember, in that movie, Mickey Mouse plays a sorcerer apprentice who casts a spell so that his broom will do his chores in his stead. But things rapidly get wildly out of control. When Mickey tries to ax the brooms to stop the whole thing, each fragment rapidly grows back into a full grown broom, and things go from bad to worse. CPUs, unfortunately, are not magic brooms: cutting a CPU in half will not magically make two full-size CPUs.

Performing well in degraded configurations

Now, I don’t want you to believe that I went all defensive because IBM found a clever way to do something that HPVM can’t do. Actually, even if HPVM warns you by default, you can still force it to start a guest in such a “stupid” configuration, using the -F switch of hpvmstart. And it’s not like HPVM systematically performs really badly in this case either.

For example, below are the build times for a Linux 2.6.25 kernel in a variety of configurations.

4-way guest running on a 4-way host, 5 jobs

[linux-2.6.26.5]# gmake -j5
real    5m25.544s
user    18m46.979s
sys     1m41.009s

8-way guest running on a 4-way host, 9 jobs

[linux-2.6.26.5]# time gmake -j9
real    5m38.680s
user    36m23.662s
sys     3m52.764s

8-way guest running on a 4-way host, 5 jobs

[linux-2.6.26.5]# time gmake -j5
real    5m35.500s
user    22m25.501s
sys     2m6.003s

As you can verify, the build time is almost exactly the same, whether the guest has 4 our 8 virtual CPUs. As expected, the additional virtual CPUs do not bring any benefit. In that case, the degradation exists, but it is minor. It is however relatively easy to build cases where the degradation would be much larger. Another observation is that running only enough jobs to keep 4 virtual CPUs busy actually improves performance: less time is spent for the virtual CPUs to wait on one another.

So, why do we even test such configurations or allow them to run, then? Well, there is always the possibility that a CPU goes bad, in which case the host operating system is most likely to disable it. When that happens, we may end up running with an insufficient number of CPUs. Even so, this is no reason to kill the guest. We still want to perform as well as we can, until the failed CPU is replaced with a good one.

In short, I think that HPVM is doing the right thing by telling you if you are about to do something that will not be efficient. However, in case you found yourself in that situation due to some unplanned event, such as a hardware failure, it still does the hard work to keep you up and running with the best possible performance.

Remaining balanced and fair

There is another important point to consider regarding the performance of virtual machines. You don’t want virtual machines to just perform well, you also care a lot about maintaining balance between the various workloads, both inside the virtual machine itself, and between virtual machines. This is actually very relevant to scalability, because multi-threaded or multi-processor software often scales worse when some CPUs run markedly slower than others.

Consider for example that you have 4 CPUs, and divide a task into four approximately equal-sized chunks. The task will only complete when all 4 sub-tasks are done. If one CPU is significantly slower, all other CPUs will have to wait for it. In some cases, such as ray-tracing, it may be easy enough for another CPU to pick up some of the extra work. For other more complicated algorithms, however, the cost of partitioning may be significant, and it may not pay off to re-partition the task in flight. And even when re-partitioning on the fly is possible, software is unlikely to have implemented it if it did not bring any benefit on non-virtual hardware.

Loading virtual machines little by little…

In order to get a better feeling for all these points, readers are invited to do the following very simple experiment with their favorite virtual machine technology. To maximize chances that you can run the experiment yourself, I will not base my experiment on some 128-way machine with 1TB of memory running 200 16-way virtual machines or anything über-expensive like that. Instead, I will consider the simplest of all cases involving SMP guests: two virtual machines VM1 and VM2, each with two virtual CPUs, running concurrently on a 2-CPU machine. What could possibly go wrong with that? Nowadays, this is something you can try on most laptops…

The experiment is equally simple. We will simply incrementally load the virtual machines with more and more jobs, and see what happens. When I ran the experiment, I used a simple CPU spinner program written in C that counts how many loops per second it can perform. The baseline, which I will refer to as “100%”, is the number of iterations that the program makes on a virtual machine, with another virtual machine sitting idle. This is illustrated below, with the process Process 1 running in VM1, colored in orange.

CPU 1 CPU 2
Process 1 Idle

Now, let’s say that you start another identical process in VM2. The ideal situation is that one virtual CPU for each virtual machine gets loaded at 100%, so that each process gets a 100% score. In other words, each physical CPU is dedicated to running a virtual CPU, but the virtual CPUs belong to different virtual machines. The sum of the scores is 200%, which is the maximum you can get on the machine, and the average is 100%. This is both optimal and fair. As far as I can tell, both HPVM and IBM micro-partitions implement this behavior. This is illustrated below, with VM1 in orange and VM2 in green.

CPU 1 CPU 2
Process 1 Process 2

However, this behavior is not the only choice. Versions of VMware up to version 3 used about a mechanism called co-scheduling, where all virtual CPUs must run together. As the document linked above shows, VMware was boasting about that technique, but the result was that as soon as one virtual CPU was busy, the other physical CPU had to be reserved as well. As a result, in our little experiment, each process would get 50% of its baseline, not 100%. This approach is fair, but hardly optimal since you waste half of the available CPU power. VMware apparently chose that approach to avoid dealing with the more complicated cases where one virtual CPU would wait for another virtual CPU that was not running at the time.

CPU 1 CPU 2
Process 1 Idle
Idle Process 2

Now, let’s fire a second process in VM1. This is where things get interesting. In that situation, VM1 has both its virtual CPUs busy, but VM2 has only one virtual CPU busy, the other being idle. There are many choices here. One is to schedule the two CPUs of VM1, then the two CPUs of VM2 (even if one is idle). This method is fair between the two virtual machines, but it reserves a physical CPU for an idle virtual CPU half of the time. As a result, all processes will get a score of 50%. This is fair, but suboptimal, since you get a total score of 150% when you could get 200%.

CPU 1 CPU 2
Process 1 Process 3
Process 2 Idle

In order to optimize things, you have to take advantage of that ‘idle’ spot, but that creates imbalance. For example, you may want to allocate CPU resources as follows:

CPU 1 CPU 2
Process 1 Process 2
Process 1 Process 3

This scenario is optimal, since the total CPU bandwidth available is 200%, but it is not fair: process 1 now gets twice as much CPU bandwidth as processes 2 and 3. In the worst case, the guest operating system may end up being confused by what is going on. So one solution is to balance things out over longer periods of time:

CPU 1 CPU 2
Process 1 Process 2
Process 1 Process 3
Process 3 Process 2

This solution is again optimal and fair: process 1, 2 and 3 each get 66% of a CPU, for a total of 200%. But other important performance considerations come into play. One is that we cannot keep all processes on a single CPU. Keeping processes bound to a given CPU improves cache and TLB usage. But in this example, one of the processes (at least) will have to jump from one CPU to the other, even if the guest operating system thinks that it’s bound to a single CPU.

Another big downside as far as scalability is concerned is with respect to inter-process communication. If processes 1 and 3 want to talk to one another in VM1, they can do so without waiting only half of the time, since during the other half, the other CPU is actually running a process that belongs to another virtual machine. A consequence is that the latency of this inter-process communication increase very significantly. As far as I can tell, this particular problem is the primary issue with the scalability of virtual machines. VMware tried to address it with co-scheduling, but we have seen why it is not a perfect solution. Statistically speaking, adding virtual CPUs increases the chances that the CPU you need will not be running, in particular when other virtual machines are competing for CPU time.

Actual scalability vs. simple metrics

This class of problems is the primary reason why HPVM limits the size of virtual machines. It is not that it doesn’t work. There are even workloads that scale really well, Linux kernel builds or ray-tracing being good examples. But even workloads that scale OK with a single virtual machine will no longer scale as well under competition. Again, virtual machine scalability is nowhere as simple as “add more virtual CPUs and it works”.

This is not just theory. We have tested the theory. Consider the graph below, which shows the results of the same benchmark run into a variety of configurations. The top blue line, which is almost straight, is perfect scalability, which you practically get on native hardware. The red line is HPVM under light competition, i.e. with other virtual machines running but mostly idle. In that case, HPVM scales well up to 16-way. The blue line below is HPVM under heavy competition. If memory serves me right, the purple line is fully-virtualized Xen… under no competition.

In other words, if HPVM supports 8 virtual CPUs today, it is because we believe that we can maintain useful scalability on demanding workloads and even under competition. We know, for having tested and measured it, that going beyond 8-way will usually not result in increased performance, only in increasing CPU churn.

One picture is worth 210 words

As we have shown, making the right decisions for virtual machines is not simple. Interestingly, even the very simple experiment I just described highlights important differences between various virtual machine technologies. After launching 10 processes on each guest, here is the performance of the various processes when running under HPVM. In that case, the guest is a Linux RedHat 4 server competing against an HP-UX partition running the same kind of workload. You can see the Linux scheduler granting time to all processes almost exactly fairly, although there is some noise. I suspect that this noise is the result of the feedback loop that Linux puts in place to ensure fairness between processes.

By contrast, here is how AIX 6.1 performs when running the same workload. As far as I can tell, IBM implements what looks like a much simpler algorithm, probably without any feedback loop. It’s possible that there is an option to enable fair share scheduling on AIX (I am much less familiar with that system, obviously). The clear benefit is that it is very stable over time. The downside is that it seems quite a bit unfair compared to what we see in Linux. Some processes are getting a lot more CPU time than others, and this over an extended period of time (the graph spans about 5 minutes).

The result shown in the graphs is actually a combination of the effect of the operating system and virtual machine schedulers. In the case of IBM servers, I must say that I’m not entirely sure about how the partition and process schedulers interact. I’m not even sure that they are distinct: partitions seem to behave much like processes in AIX. In the case of HPVM, you can see the effect of the host HP-UX scheduler on the total amount allocated to the virtual machine.

Conclusion

Naturally, this is a very simple test, not even a realistic benchmark. It is not intended to demonstrate the superiority of one approach over the other. Instead, it demonstrates that virtual machine scalability and performance is not as simple as counting how many CPUs your virtual machine software can allocate. There are large numbers of complicated trade-offs, and what works for one class of workloads might not work so well for others.

I would not be surprised if someone shows benchmarks where IBM scales better than HPVM. Actually, it should be expected: after all, micro-partitions are almost built into the processor to start with; the operating system has been rewritten to delegate a number of tasks to the firmware; AIX really runs paravirtualized. HPVM, on the other hand, is full virtualization, which implies higher virtualization costs. It doesn’t get any help from Linux or Windows, and only very limited help from HP-UX. So if anything, I expect IBM micro-partitions to scale better than HPVM.

Yet I must say that my experience has not confirmed that expectation. In the few tests I made, differences, if any, were in HPVM’s favor. Therefore, my recommendation is to benchmark your application and see which solution performs best for you. Don’t let others tell you what the result should be, not even me…

IBM’s comparison of virtual machines…

IBM just posted a comparison of virtual machines that I find annoyingly flawed. Fortunately, discussion on OSnews showed that technical people don’t buy this kind of “data”. Still, I thought that there was some value in pointing out some of the problems with IBM’s paper.

nPartitions are tougher than logical partitions

What HP calls ‘nPartitions’ are electrically isolated partitions. IBM correctly indicates that nPartitions offers true electrical isolation. The key benefit, using IBM’s own words, is that nPartitions allow you to service one partition while others are online. You can for example extract a cell, replace or upgrade CPUs or memory, and re-insert the cell without downtime.

The problem is when IBM states that this is similar to IBM logical partitioning. In reality, electrically-isolated partitions are tougher. Like on IBM systems, CPUs can be replaced in case of failure. But you can replace entire cells without shutting down the system, contrary to IBM’s claim that systems require a reboot when moving cells from one partition to another.

Finally, IBM remarks that Another downside is that entry-level servers do not support this technology, only HP9000 and Integrity High End and Midrange servers. Highly redundant, electrically isolated partitions have a cost, so they don’t make so much sense on entry level systems. But it’s not a downside of HP that they only have partitioning technologies similar to IBM’s on entry level systems, it’s a downside of IBM not to have anything similar to nPartitions on their high-end servers.

Integrity servers support more operating systems

IBM writes: It’s important to note that while nPartitions support HP-UX, Windows®, VMS, and Linux, they only do so on the Itanium processor, not on the HP9000 PA Risc architecture. It is true that HP (Microsoft, really) doesn’t support Windows on PA-RISC, a long obsolete architecture, but what is more relevant is that IBM doesn’t support Windows on POWER even today. This is the same kind of spin as for electrically-isolated partitions: IBM tries to divert attention from a missing feature in their product line by pointing out that there was a time when HP didn’t have the feature either.

IBM writes that Partition scalability also depends on the operating system running in the nPartition,. Again, this is true… on any platform. But the obvious innuendo here is that the scalability of Integrity servers is inconsistent across operating systems. A quick visit to the TPC-H and and TPC-C result pages will demonstrate that this is false. HP posts world-class results with HP-UX and Windows, and other Itanium vendors demonstrate the scalability of Linux on Itanium.

By contrast, the Power systems from IBM or Bull all run AIX. And if you want to talk about scalability, IBM as of today has no 30TB TPC-H results, and it often takes several IBM machines to compete with a single HP system.

Regarding scalability of other operating systems, HP engineers have put a lot of effort early into Linux scalability. I would go as far as saying that 64-way scalability of Linux is probably something that HP engineers tested seriously before anybody else, and they got some recognition for this work.

HP offers both isolation and flexibility

IBM also tries to minimize the flexibility of hard partitions. If this is electrically isolated, it can’t be flexible, right? That’s exactly what IBM wants you to believe: They also do not support moving resources to and from other partitions without a reboot.

However, HP has found a clever way to provide both flexibility and electrical isolation. With HP’s Instant Capacity (iCap), you can move “right to use” CPUs around. On such a high-end machine, it is a good idea to have a few spare CPUs, either in case of CPU failure or in case of a sudden peak in demand. HP recognized that, so you can buy machines with CPUs installed, but disabled. You pay less for these CPUs, you do not pay software licenses, and so on.

The neat trick that iCap enables is to transfer a CPU from one nPartition to another without breaking electrical isolation: you simply shut it down in one partition, which gives you the right to use it in another partition without paying anything for it. All this can be done without any downtime, and again, it preserved electrical isolation between the partitions.

Shared vs. dedicated

Regarding virtual partitions, IBM is quick to point out the “drawbacks”: What you can’t do with vPars is share resources, because there is no virtualized layer in which to manage the interface between the hardware and the operating systems. Of course, you cannot do that with dedicated resources on IBM systems either, that’s the point of having a partitioning system with dedicated resources! If you want shared resources, HP offers virtual machines, see below. So when IBM claims that This is one reason why performance overhead is limited, a feature that HP will market without discussing its clear limitations, the same also holds for dedicated resources on IBM systems.

What is true is that IBM has attempted to deliver in a single product what HP offers in two different products, virtual partitions (vPars) and virtual machines (HPVM). The apparent benefit is that you can configure systems that have a mix of dedicated and shared resources. For example, you can have in the same partition some disks directly attached to a partition like they would be in HP’s vPars, and other disks served through the VIO in a way similar to HPVM. In reality, though, the difference between the way IBM partitions and HPVM virtual machines work is not that big, and if anything, is going to diminish over time.

A more puzzling statement is that The scalability is also restricted to the nPartition that the vPar is created on. Is IBM really presenting as a limitation the fact that you can’t put more than 16 CPUs in a partition on a 16-CPU system? So I tested that idea. And to my surprise, IBM does indeed allow you to create for example a system with 4 virtual CPUs on a host with 2 physical CPUs. Interestingly, I know how to create such configurations with HPVM, but I also know that they invariably run like crap when put under heavy stress, so HPVM will refuse to run them if you don’t twist its arm with some special “force” option. I believe that it’s the correct thing to do. My quick experiments with IBM systems in such configurations confirms my opinion.

IBM also comments that There is also limited workload support; resources cannot be added or removed. Resources can be added or removed to a vPar, or moved between vPars, so I’m not sure what IBM refers to. Similarly, when IBM writes: Finally, vPars don’t allow you to share resources between partitions, nor can you dynamically allocate processing resources between partitions, I invite readers to check the transcript of a 3-years old demo to see for how long (at least) that statement has been false…

Virtual Machines

Finally, IBM also gives false or outdated information regarding HP Integrity Virtual Machines, aka HPVM (not “IVM”, which is an IBM product.)

HPVM now supports 8 virtual CPUs, not 4, and it has always been able to take advantage of the much large numbers of physical CPUs HP-UX supports (128 today). Please note that 8 virtual CPUs is what is supported, not what works (the red line in the graph shows 16-way scalability of an HPVM virtual machine under moderate competition using a Linux scalability benchmark; the top blue line is the scalability on native hardware; the bottom curves are HPVM under maximal competition and Xen under no competition.)

IBM also states that Reboots are also required to add processors or memory. , but this is actually not very different from their own micropartitions. In a Linux partition with a kernel supporting CPU hotplug, for example, you can disable CPU3 under HPVM by typing echo 0 > /sys/devices/system/cpu/cpu3/online, just like you would on a real system (on HP-UX, you would use hpvmmgmt -c 3 to keep only 3 CPUs). HPVM also features dynamic memory in a virtual machine, meaning that you can add or remove memory while the guest is running. You need to reboot a virtual machine only to change the maximum number of CPUs or the maximum amount of memory, but then that’s also the case on IBM systems: to change the maximum number of CPUs, you need to modify a “profile”, but to apply that profile, you need to reboot. However, IBM micro-partitions have a real advantage (probably not for long) which is that you can boot with less than the maximum, whereas HPVM requires that the maximum resources be available at boot time.

IBM wants you to believe that There is no support for features such as uncapped partitions or shared processor pools. In reality, HPVM has always worked in uncapped mode by default. Until recently, you needed HP’s Global Workload Manager to enable capping, because we thought that it was a little too cumbersome to configure manually. Our customers told us otherwise, so you can now directly create virtual machines with capped CPU allocation. As for shared processor pools, I think that this is an IBM configuration concept that is entirely unnecessary on HP systems, as HPVM computes CPU placement dynamically for you. Let me know if you think otherwise…

The statement IBM made that Virtual storage adapters also cannot be moved, unless the virtual machines are shut down. is also untrue. All you need is to use hpvmmodify to add or remove virtual I/O devices. There are limits. For example, HPVM only provisions a limited number of spare PCI slots and busses, based on how many devices you started with. So if you start with one disk controller, you might have an issue going to the maximum supported configuration of 158 I/O devices without rebooting the partition. But if you know ahead of time that you will need that kind of growth, we even provide “null” backing stores that can be used for that purpose.

But the core argument IBM has against HP Virtual Machines is: The downside here is scalability. Scalability is a complex enough topic that it deserves a separate post, with some actual data…

Preserving the investment matters

Let me conclude by pointing out an important fact: HP has a much better track record at preserving the investment of their customers. For example, on cell-based systems, HP allowed mixed PA-RISC and Itanium cells to facilitate the transition.

This is particularly true for virtual machines. HPVM is software, that HP supports on any Integrity server running the required HP-UX (including systems by other vendors). By contrast, IBM micropartitions are firmware, with a lot of hardware dependency. Why does it matter? Because firmware only applies to recent hardware, whereas software can be designed to run on older hardware. As an illustration, my development and test environments consists in severely outdated zx2000 and rx5670 (the old 900MHz version). These machines were introduced in 2002 and discontinued in 2003 or 2004. In other words, I develop HPVM on hardware that was discontinued before HPVM was even introduced…

By contrast, you cannot run IBM micro-partitions on any POWER-4 system, and you need new POWER-6 systems in order to take advantage of new features such as Live Partition Mobility (the ability to transfer virtual machines from one host to another). HP customers who purchased hardware five years ago can use it to evaluate the equivalent HPVM feature today, without having to purchase brand new hardware.

Update: About history

The IBM paper boasts about IBM’s 40-plus year history of virtualization, trying to suggest that they have been in the virtualization market longer than anybody else. In reality, the PowerVM solution is really recent (less than 5 years old). Earlier partitioning solutions (LPARs) were comparable to HP’s vPars, and both technologies were introduced in 2001.

But a friend and colleague at HP pointed out that a lot of these more modern partitioning technologies actually originated at Digital Equipment, under the name OpenVMS Galaxy. And the Galaxy implementation actually offered features that have not entirely been matched yet, like APIs to share memory between partitions.

Multiple operating systems on a single tomato

If you want to learn something about virtualization, this video is not necessarily the most extensive technical explanation you might get, but it’s short and it’s funny. And Debbie is really our program manager.

There are many other videos here

Categories: Funny, Virtualization

Oh, the joy of not being remote…

Following my CGO talk, I went to visit my colleagues in Nashua, New-Hampshire.

Nashua ZKO site shutting down

The so-called ZKO building is a historical landmark in the history of DEC. For the old-timers in computer science, @zko.dec.com was a pretty healthy thing to have in your e-mail address… Its walls are layered with pictures of early VMS luminaries. The Integrity VM project was fortunate to inherit some of this historical expertise that DEC (don’t say “Compaq” to these folks…) had built in operating systems. We have in the team people who not only know what TOPS-10 and TOPS-20 were, but actually wrote or maintained parts of it.

But after so many years, Hewlett-Packard will be closing ZKO soon, and transferring people to another HP building in Marlborough. As a result, many of my colleagues will have to move to new offices, and others decided to work from home. So I thought this was a good time to visit and meet the team in one place, while this was still possible.

Working remotely is hard

I have strong reservations about working from home or remotely, however. This is much harder than corporations seem to think. I’ve been doing it occasionally for almost 5 years now. Of course, part of the problem is that the HP VPN is sized for people doing PowerPoint and Outlook e-mail. It’s easily overloaded when too many people at once try to do something interactive, like a Remote Desktop or VNC session. So whenever I try to work from home, two days out of three, I end up returning to the office to be able to work in decent conditions. I hope that our team will not be impacted the same way.

There is also a whole dimension of face-to-face interaction that no amount of phone conversation or chat or e-mail can compensate for. Face to face, you become friends. Over e-mail, you are at best acquaintances. This is something that the proponents of teleworking fail to realize: just how many friends did you make over e-mail? how many over a beer, or a coffee, or just chatting face to face?

Another factor in my case is time zone differences. When the Nashua team starts working, it’s 3PM in France. When the Cupertino team starts working, it’s 6PM. I have regular meetings after 6PM two or three days a week, and a weekly meeting on Thursdays that finishes at 9PM and is killing me. No amount of technology really helps with that.

In any case, I was so happy to spend some quality time with the team. I met for the first time several people I had been working with for years! This was a treat.

Categories: HP Integrity VM

CGO / EPIC-7

Today, I gave a talk about Integrity Virtual Machines at the EPIC-7 workshop of the CGO-2008 conference. The audience was smaller than I hoped, but quite interesting. This was a good occasion to make new contacts, and to renew contacts with old friends. After 7 years working on Integrity VM, this was actually the first time I talked publicly about it outside of HP (although Todd Kjos gave a talk at Gelato in April 2007).

There were a number of other interesting talks. Rohit Bhatia presented Tukwila, the next generation Itanium. I did not learn much during this talk, but it was interesting to see the Intel take on this processor (I usually only hear the HP side of the story). This somewhat renewed my confidence about this platform. It’s a very interesting processor for a programmer, but that alone has never been a guarantee of success. Clearly, the initial expectations for Itanium were set a little bit too high, but it’s interesting to see that Intel doesn’t give up. And hearing them boast about Tukwila as “the fastest processor in the industry” opens some perspectives.

One thing that I’m still wondering about is how we can explain customers why Itanium will stick at relatively modest number of cores (e.g. four), whereas Larrabee will feature a much larger number (16 or 24 according to Wikipedia). Hmmm…

Categories: HP Integrity VM

Hyper-V: Linux or not Linux?

A recent column by Mitchell Ashley argues that Microsoft’s upcoming Hyper-V virtualization platform (formerly known as Viridian) “leaves out Linux in the cold“, because it only supports SuSE Linux and not the bigger contenders like RedHat and Ubuntu.

I believe that Mitchell Ashley misses two important points in his analysis:

  • The US market, where RedHat and Ubuntu dominate, is not the market where Microsoft has the most trouble with Windows. In Europe and Asia, their dominance is not as clear. SuSE is a key player in Europe, in particular in Germany, and these are also the locations where governments threaten to standardize on “open platforms”. So instead of focusing on markets where it has little to gain, Microsoft may be after markets where Windows is threatened.
  • The relative cost of software has gone up tremendously, and now is the majority of the purchase cost of any IT infrastructure. Long gone are the days when IBM simply gave the software for free when you purchased its hardware. So Microsoft may be playing catch-up, but as long as they can offer deals you can’t refuse regarding the licenses of Microsoft Windows (e.g. it’s much cheaper to run 4 Windows VMs under Hyper-V than under VMware), they have the possibility to tie rocks to the other guys’ ankles…

On a different topic, one of the comments suggests that SuSE only runs thanks to a binary-only kernel module. That would prove interesting if this is indeed the case. While binary kernel modules have been used for specific proprietary hardware such as 3D graphics cards, I don’t think it’s ever been the case before that you needed one for the kernel itself.

If it’s some kind of paravirtualization or acceleration as I suspect (another comment about someone running other kernels tends to confirm that viewpoint), then it’s a bit different. But if you need some proprietary binary simply to run Linux, I believe that this will cause some backlash from the Free Software community.

Categories: Microsoft, Virtualization
Follow

Get every new post delivered to your Inbox.

Join 365 other followers

%d bloggers like this: