Archive

Archive for the ‘Programming’ Category

Optimize!

Is it normal to wait for your computer? Why should I wait 5 seconds when I click on a menu? Why does it sometimes take half a minute to open a new document? Developers, optimize your code, if only as a matter of public service! What about making it a New Year resolution?

Why is my Mac laptop slower than my iPad?

Apple cares about iPad performance

Apple cares about iPad performance

I have a serious issue with the fact that on a laptop with 8G of RAM, 1TB of hard disk, a quad-core 2GHz i7, I spend my time waiting. All the time. For long, horribly annoying pauses.

Just typing these few paragraphs had Safari go into “pause” twice. I type something and it takes ten seconds or so with nothing showing up on screen, and then it catches up. Whaaaaat? How did programmers manage to write code so horribly that a computer with a quad-core 2.6GHz i7 can’t even keep up with my typing? Seriously? The Apple II, with its glorious 1MHz 8-bit 6502 never had trouble keeping up, no matter how fast I typed. Nor did Snow Leopard, for that matter…

Even today, why is it that I always find myself waiting for my Mac as soon as I have 5 to 10 applications open, when a poor iPad always feel responsive even with 20 or 30 applications open at the same time? Aren’t we talking about the same company (Apple)? About the same core operating system (Darwin being the core of both iOS and OSX)? So what’s the difference?

The difference, folks, is optimizations. Code for iOS is tuned, tight, fit. Applications are programmed with severe hardware limitations in mind. The iPad, for instance, is very good at “pausing” applications that you are not using and recalling them quickly when you switch to them. Also, most applications are very careful in their use of resources, in particular memory and storage. Apple definitely cares about the performance of the iPad. There was a time the performance of the Mac mattered as well, but that was a long time ago.

Boiled frog syndrome : we slowly got used to desktops or laptops being slower than tablets, but it’s just plain stupid.

Lion and Mountain Lion are Dog Slow

It's obvious why they called it Lion...

It’s obvious why they called it Lion…

I’ve been running every single version of MacOSX since the Rhapsody days. Up until Snow Leopard, each release was a definite improvement over the previous version. Lion and Mountain Lion, on the other hand, were a severe step backwards…

Lion and Mountain Lion were not just loaded with features I didn’t care about (like crippling my address book with Facebook email addresses), they didn’t just break features I relied on on a daily basis (like full screen applications that works with multiple monitors, or RSS feeds). They were slow.

We are not talking about small-scale slowness here. We are talking about molasses-fed slugs caught in a tar pit, of lag raised to an art form, of junk code piling up at an industrial scale, of inefficiency that makes soviet car design look good in comparison.

And it’s not just me. My wife and my kids keep complaining that “the machine lags”. And it’s been the case with every single machine I “upgraded” to Lion or Mountain Lion. To the point where I’m not upgrading my other machines anymore.

In my experience, the core issue is memory management. OSX Lion and Mountain Lion are much worse than their predecessors at handling multiple programs. On OSX, the primary rule of optimization seems to be “grab 1GB of memory first, ask questions later.” That makes sense if you are alone: RAM is faster than disk, by orders of magnitude, so copying stuff there is a good idea if you use it frequently.

But if you share the RAM with other apps, you may push those other apps away from memory, a process called “paging“. Paging depends very largely on heuristics, and has major impact on performance. Because, you see, RAM is faster than disk, by orders of magnitude. And now, this plays against you.

Here is an example of a heuristic that I believe was introduced in Lion: the OS apparently puts aside programs that you have not been using for a long while. A bit like an iPad, I guess. On the surface, this seems like a good idea. If you are not using them, free some memory for other programs. But this means that if I go away from my laptop and the screen saver kicks in, it will eat all available RAM and push other programs out. When I log back in… I have 3GB of free RAM and a spinning beach ball. Every time. And even if the screensaver does not run, other things like backupd (the backup daemon) or Spotlight surely will use a few gigabytes for, you know, copying files, indexing them, stuff.

Boiled frog syndrome : we slowly got used to programs using thousands of Mac128K worth of memory to do simple things like running a screensaver. It’s preposterous.

Tuning memory management is very hard

Virtual Memory is complicated

Virtual Memory is complicated

The VM subsystem, responsible for memory management, was never particularly good in OSX. I remember a meeting with an Apple executive back in the times OSX was called Rhapsody. Apple engineers were all excited about the new memory management, which was admittedly an improvement over MacOS9.

I told the Apple person I met that I could crash his Mac with 2 minutes at the keyboard, doing only things a normal user could do (i.e. no Terminal…) He laughed at me, gave me his keyboard and refused to even save documents. Foolish, that.

I went to the ancestor of Preview.app, opened a document, clicked on “Zoom” repeatedly until the zoom factor was about 6400% or so. See, in these times, the application was apparently allocating a buffer for rendering that was growing as you zoomed. The machine crawled to a halt, as it started paging these gigabytes in and out just to draw the preview on the screen. “It’s dead, Jim“, time to reboot with a long, hard and somewhat angry press on the Power button.

That particular problem was fixed, but not the underlying issue, which is a philosophical decision to take control away from users in the name of “simplicity“. OS9 allowed me to say that an App was supposed to use 8M of RAM. OSX does not. I wish I could say: “Screen Saver can use 256M of RAM. If it wants more, have it page to disk, not the other apps.” If there is a way to do that, I have not found it.

Boiled frog syndrome : we have slowly been accustomed by software vendors to give away control. But lack of control is not a feature.

Faster machines are not faster

A 1986 Mac beats a 2007 PC

A 1986 Mac beats a 2007 PC

One issue with poor optimizations is that faster machines, with much faster CPUs, GPUs and hard disks, are not actually faster to perform the tasks the user expects from them, because they are burdened with much higher loads. It’s as if developers always stopped at the limit of what the machine can do.

It actually makes business sense, because you get the most of your machine. But it also means its easy to push the machine right over the edge. And more to the point, an original 1986 Mac Plus will execute programs designed for it faster than a 2007 machine. I bet this would still hold in 2013.

So if you have been brainwashed by “Premature optimization is the root of all evil“, you should forget that. Optimizing is good. Optimize where it matters. Always. Or, as a colleague of mine once put it, “belated pessimization is the leaf of no good.”

Boiled frog syndrome : we have slowly been accustomed to our machines running inefficient code. But inefficiency is not law of nature. Actually, in the natural world, inefficiency gets you killed. So…

Optimize!

Taodyne – Tao Presentations documents: execution and drawing model

This article explains how Tao Presentations proceeds to transform a document (.ddd) into images on the screen. It describes an interesting mechanism making it easy to create dynamic documents that depend on events such as time or mouse movements. Moreover, this technique also allows Tao Presntations to optimize the rendering of graphic elements, enabling smooth 60Hz drawings even with complex contents.

Taodyne – Tao Presentations documents: execution and drawing model.

Meta: Talk about LLVM

I gave a short talk about LLVM today. The link to the talk is tao://git.taodyne.com/examples/LLVM (sorry, can’t make it a hyperlink, as WordPress removes the tao:// part…). To watch this link, you will need Tao Presentations, which itself uses LLVM for the rendering.

Rendering of the LLVM talk with LLVM within Tao Presentations

This is not the first meta-talk made with Tao Presentations, but for some reason, this one reminds me of the first time I presented HPVM from a Powerpoint within HPVM.

Taodyne’s best wishes – Source code version

I have release the source code for the “Season’s Greetings” from Taodyne. This can give you an idea of what XL can do today. In that video, which is rendered in real-time and in 3D, the XL program executes 60 times per second to synchronize the animations on the screen with the background movie.

In other XL-related news, I’ve been working on an updated language reference document. This describes where I want to bring XL, notably the type system and library. A lot of what is in this document is not implemented yet, but it will give the directions that I intend to follow in the coming months.

Designing a new programming language

I’ve been working for years on a programming language called XL. XL stands for “extensible language”. The idea is to create a programming language that starts simple and grows with you.

All the work of Taodyne, including our fancy 3D graphics, is based on XL. If you think that Impress.js is cool (and it definitely is), you should definitely check out what Tao Presentations can do. It’s an incredible environment for quickly putting together stunning 3D presentations. And we are talking real 3D here, with or without glasses depending on your display technology. We can even play YouTube videos directly, as the demo below shows.

But the paradox is that the work for Taodyne keeps me so busy I’m no longer spending as much time on the core XL technology as I’d like. Fortunately, vacation periods are an ideal time to relax, and most importantly, think. So I guess it’s no chance if Christmas is one of the periods where XL leaps forward. Last year, for example, I implemented most of the type inference I wanted in XL, and as a result, got it to perform close to C on simple examples.

This year was no exception. I finally came up with a type system that I’m happy with, that is simple to explain, and that will address  the demands of Tao Presentations (notably to implement easy-to-tweak styles). The real design insight was how to properly unify types, functions and data structures. Yep, in XL, types are “functions” now (or, to be precise, will be very soon, I need to code that stuff now).

Another interesting idea that came up is how to get rid of a lot of the underlying C++ infrastructure. Basically, symbol management for XLR was currently dome with C++ data structures called “Symbols” and “Rewrites”. I found a way to do exactly the same thing only with standard XLR data structures, which will allow XLR programs to tweak their own symbol tables if they choose to do so.

An interesting rule of programming: you know you have done something right when it allows you to get rid of a lot of code while improving the capabilities of the software.

Back to coding.

In defense of Euclideon’s Unlimited Details claims

There’s been a bit of a controversy recently regarding Euclideon’s “Unlimited Details” 3D rendering technology:

The video that caused all this stir is here:

This looks pretty good. Yet the video shows a number of limitations, like repeated patterns at high scale. These limitations may explain why John Carmack, Notch and others say that Euclideon is only implementing well-known algorithms and scamming people for funding.

Is it new? Is it real?

As for me, I wouldn’t call it a scam, and I don’t think the technology is that far out. Euclideon’s technology is in my opinion a little more than just sparse voxels octrees(SVOs), though the general principle are most likely similar. That being said, I must agree with Notch that Euclideon is anything but above hyperlatives, something I discussed recently.

To more directly address the criticisms being made by Notch, what they have done looks like a real technical innovation worth funding. For example, they deal with shadows and illumination, something that is hard to do with SVOs. They demonstrated some basic animation. They don’t have to tell all of us the gory technical details and limitations, as long as they tell their investors. And from the interview, it seems that they had people do some due diligence.

Regarding Carmack’s tweet, it does not dismiss the technology. It’s public knowledge that Carmack is himself working on similar stuff. He said this:

In our current game title we are looking at shipping on two DVDs, and we are generating hundreds of gigs of data in our development before we work on compressing it down.  It’s interesting that if you look at representing this data in this particular sparse voxel octree format it winds up even being a more efficient way to store the 2D data as well as the 3D geometry data, because you don’t have packing and bordering issues.  So we have incredibly high numbers; billions of triangles of data that you store in a very efficient manner.  Now what is different about this versus a conventional ray tracing architecture is that it is a specialized data structure that you can ray trace into quite efficiently and that data structure brings you some significant benefits that you wouldn’t get from a triangular structure.  It would be 50 or 100 times more data if you stored it out in a triangular mesh, which you couldn’t actually do in practice.

Carmack and Notch have high stakes in the game. To them, a technology like Euclideon’s is threatening, even if mildly. Dismissing something new because we are experts in the old ways is so easy to do. But let’s not get used to it.

3D rendering before GPUs

What legitimacy do I have myself to give my opinion on this topic, you may ask? I wrote a game called Alpha Waves a couple of years before John Carmack’s claimant to the title of “the first 3D PC game ever!“. That game is in the Guiness Book as “The first 3D platform game” (not the first 3D game, since there were several much simpler 3D games before).

More importantly, Alpha Waves was as far as I know the first 3D game offering interaction with all 3D objects in the environment: you could touch them, bounce off them, etc. Also new for the time, the game was 6-axis (all rotations were possible), fullscreen (most games at the time showed 3D in a quarter screen or half screen), and showed 20 to 50 3D objects simultaneously (the standard back then was 1 to 5). Games like Carmack’s Hovertank 3D had none of these capabilities. This blog post explains how this was achieved.

In short, I have been thinking about 3D rendering for a long time. I have written 3D rendering engines in assembly language at a time when there was no graphic card and no floating-point on the CPU. Back then, drawing a line or a circle was something you had to code by yourself. Not having GPUs or 3D APIs whatsoever meant that you attacked the problem from all angles, not just as “a polygon mesh with texture mapping”.

(As an aside, I’m still doing 3D these days, but even if we are pushing the boundaries of what today’s graphic cards can to (e.g. they are not optimized for rendering text in 3D on auto-stereoscopic displays), that’s much less relevant to the discussion.)

Exploring 3D rendering algorithms

In the late 1980s, when many modern 3D rendering algorithms were still to invent, I continued exploring ways to render more realistic 3D, as did many programmers at the time. I tried a number of ideas, both on paper and as prototypes. I tried software texture mapping (a la Wolfenstein), which looked positively ugly with only 16 simultaneous colors on screen. I reinvented Bresenham’s line drawing algorithm and then generalized it to 2D shapes like circles or ellipses, and then tried to further generalize for 3D planar surfaces and shapes, or to 3D rays of light.

This was how you proceeded back then. There was no Internet accessible to normal folks, no on-line catalog of standard algorithms or code. We had books, magazines, person-to-person chat, computer clubs. As a result, we spent more time inventing from scratch. I tried a number of outlandish algorithms that I have never read anything about, which is a good thing because they didn’t work at all.

More to the point, I also designed something that, on paper, looked a bit like what Unlimited Details is supposed to be today. At the time, it went nowhere because it exceeded the available memory capacity by at least three orders of magnitude. Of course, we are talking about a time where having 512K of RAM in your machine was pretty exciting, so three orders of magnitude is only 512M, not a big deal today. It’s based on this old idea that I believe Euclideon’s claims have some merit.

What do we know about Unlimited Details?

Here are some facts you can gather about Unlimited Details technology. I will discuss below why I believe these observations teach us something about their technology.

  • The base resolution is 64 atoms per cubic millimeter. Like Notch, I thought: that’s “4x4x4″. Please note that I write it’s the resolution, not the storage. I agree with Notch that they can’t be storing data about individual atoms, nor do they claim they do. It’s pretty obvious we see the same 3D objects again and again in their “world”.
  • They claim that their island is 1km x 1km. That means 4 million units along each direction, which is well within what a 32-bit integer coordinate can hold. So it’s not outlandish.
  • They claim repeatedly that they compute per pixel on screen. This is unlike typical 3D game rendering today, which compute per polygon, then render polygons on screen. There are many per-pixel technologies, though: SVOs are one, but raytracing and others are also primarily per screen pixel.
  • There’s a lot of repetitive patterns in their video, particularly at large scale. For example, the land-water boundaries are very blockish. But I think this is not a limitation of the technology, just a result of them building the island quickly out of a logo picture. Below is an example that shows no such effect:

  • Notch points out that objects tend to all have the same direction, which he interprets as demonstrating that they use SVO, where rotations are hard to do.
  • On the contrary, close-ups of the ground or the picture above do not show such obvious pattern repetitions, and objects seem to have various orientations and, even more telling, to “intersect” or “touch” without obvious “cubic” boundaries. My impression is that at high-level, they organized objects in cubic blocks to create the demo, but that we shouldn’t infer too much from that.
  • Euclideon shows dynamic reflexions on water surfaces. What we see is only very planar water surfaces (no turbulence, …), however.
  • They have some form of dynamic shadow computation, but it’s still in progress. They talk about it, but it’s also quite visible around 1:30 in the video above, or even better, in the picture below from their web site:

  • They do software rendering at 1024×768 at around 20fps “unoptimized”, single-threaded on a modern i7 CPU. They claim that they can improve this by a factor of 4.
  • When they zoom out of the island, the colors fade. I think that this is a rather telling clue about their renderer. I didn’t see any transparency or fog, however.
  • On a similar note, there is no visible “flicker” of individual pixels as they zoom in or out, as you’d expect if they really picked up an individual atom. The color of nearby atoms being different, either their algorithm is very stable with respect to the atom it picks (i.e. it is not solely based on the direction of an incoming ray), or they average things out with neighboring atoms.
  • Last, and probably the most important point: they know quite well what others are doing, they are quoting them, and they are pointing out their difference.

Voxels and Sparse Voxel Octrees?

There are a number of people working with voxels and sparse voxels octrees. You can search videos on YouTube. They all share a number of common traits:

  • They generally look blocky. The Euclideon videos look blocky at large scale, but all the blocks themselves remain highly detailed. You never see uniform squares or cubes appear as in most SVO demos, e.g. Voxelstein 3D below. This may be due to the 4 atoms-per-mm resolution. But see below the related problem.

  • It’s difficult to do nice looking shadows and lighting with SVOs. Look on the web to see what I mean. The counter examples I saw used some form of raytracing, non real-time rendering, or were GPU demos. The shadows and lighting in Euclideon’s demos look quite smooth.

But there’s one big issue with SVOs, which is traversal time (see this article for a good background). Consider the following numbers: Euclideon claims to draw 1024×768 pixels, 20 times per second. That’s roughly 16M pixels per second. On a 2GHz CPU, that leaves on the order of 100 cycles per pixel to find the object and compute the color of the pixel. In the best case, that’s roughly 500 instructions per pixel, but don’t forget that an ILP level of 5 is quite hard to sustain. A single cache miss can easily cost you 100 cycles.

Memory considerations

And that’s really the big question I have with Euclideon’s claims. How do they minimize cache misses on a data set that is necessarily quite big? Consider the minimum amount of data you need for a sphere of radius R, The area of the sphere is 4πr2.

Even if we have atom data only for the surface, a sphere of 1000 units would use roughly 12 million atom storage elements. It’s not a big sphere either: a unit being reportedly 0.25mm, so that would be a sphere with a 25cm radius, 10 inches for imperialists. An object of the size of the elephant (1m) takes roughly 200M atoms, and the big statues must be around 500M atoms. This explains why “memory compression” is something that Euclideon repeatedly talks about in their video. My guess here is that there aren’t that many kinds of atoms in the elephant, for example. I’m tempted to think they have “elements” like in the physics world.

But memory compression doesn’t solve cache latency issues. That, combined with the lack of spatial or temporal flicker, tells me that they found a really smart way for their “atom search” to be stable and to preserve good memory locality. My guess is that they do this based on a “radius” of the incoming ray. If the camera is far from the object being looked at, you only pick “large” atoms considered representative of the whole. Identifying a good candidate “average” atom is something you can do at object encoding time.

Number of instructions per pixel

Assuming Euclideon solved memory latency so that, on average, they get, say, a good 200 instructions per pixel. We start with a resolution of 4 million units in the X, Y and Z direction, and we need to find the target voxel. So we start with (roughly) 22 bits along each axis. My guess is they may have less along the Z axis, but that doesn’t make a big difference. With an octree, they need, roughly, 22 steps through the octree to get down to the individual “atom”. That leaves 200/22, or roughly 10 machine instructions per octree traversal step. That’s really not a lot.

Of course, in general, you don’t need to get down to the individual atoms. The whole viability of the octree scheme is largely based on the fact that you can stop at a much higher granularity. In reality, the size of the atoms you need to find is a function of the screen size, not the individual atom resolution. We are therfore looking more at 10 iterations instead of 20. But still, that won’t leave us with more than about 20 instructions.

Ah, but wait, an SVO algorithm also requires a small amount of ray tracing, i.e. for each pixel, you need to check along the way if a ray starting at that pixel is hitting things. You need to repeat along the path of the ray until you hit something. And ray displacement computations also take time. The amount of time depends on the particular algorithm, but to my knowledge, the best scenario is roughly a log of the number of possible hits along the way (i.e. you do another binary search). So if we keep the same 22 binary searches along the way, that leaves us with less than one instruction per step on average.

Clearly, these numbers explain why there is some skepticism of Euclideon’s claims. The CPUs today are just short of being able to do this. Which is why the SVO demos we see require smaller worlds, lower resolution. All this reduces the number of bits significantly, making it much easier to achieve correct performance. But this analysis shows why a simple standard SVO algorithm is unlikely to be what Euclideon uses.

Just short = Optimizations are useful

In general, I tend to stick with the First Rule of Optimization: Don’t. But being “just short” is that exact time when tiny optimizations can do a whole lot of difference between “it works” and “it doesn’t”.

At the time of Alpha Waves, we were “just short” of being able to do 3D on a PC, because the mathematics of 3D required multiplications, which were tens of cycles on the machines of the time. The tiny optimization was to replace 9 multiplications costing each 14 to 54 cycles with three additions costing 4 cycles each (on the 68K CPU of my Atari ST). I often did multiple additions, but even so, that was less expensive. And that single optimization made the whole scheme viable, it made it possible for Alpha Waves to render 30 objects on screen when others programs of the time slowed down with 3.

So what kind of optimizations can we think about in the present case? Here are some semi-random ideas:

  • Assuming 21 bits instead of 22, or making the Z axis smaller (i.e. you could have 24+24+16, where the Z axis would be “only” 65 meters deep), you can make all three coordinates fit in a 64 bit value. So basically, each “atom” in the world is uniquely addressed by a 64-bit “address”. You can then rephrase the problem as “map a 64-bit value to a sparse data structure”.
  • You don’t need the order of bits in the “address” to be fixed. There are plenty of multimedia instructions for reshuffling bits these days. Among other things, you can work with X:Y:Z triples where X, Y and Z have 20 bits, 2 bits or 1 bit. For example, if X:Y:Z are packed bit-by-bit, a sequence of 3 bits can serve as an index to resolve a step of the octree traversal.
  • If there’s no transparency, you can draw a ray from the camera to the atom, and know that you will stop exactly once, all atoms before it being transparent. If you have a variable size for atoms, you can have very large transparent atoms so that advancing the ray will be pretty fast across transparent regions. You can practically get back a factor of 20 in the instructions per iteration.
  • Computing the ray displacement using a variation on Besenham’s algorithm, you can basically do the “hard” computation once for 8 rays using symmetries, keeping only the “simple” computations in the main loop.
  • At first level, you can use memory mapping to let the hardware do the work for you. You can’t do this on 64-bits, mind you, since there are various per-OS limits on the largest range that you can map. But taking advantage of the TLB means a single instruction will do a 4-level table lookup with hardware caching. I’m pretty sure Euclideon people don’t do that (I’m suggesting it only because I did operating systems design and implementation for a little too long), but it would be cool if they did.

There are obviously many more tricks required to achieve the objective. But I did a quick back of the envelope calculation on the average cost using some of the tricks above and a couple more, and I ended up with cycles to spare. So I believe it can be done on today’s hardware.

Conclusion

John Carmack is right on the spot when he says that productization is the real issue. Prototypes are easy, products are hard. But I wouldn’t rule out seeing a game using that technology on current generation systems.

Just my two bits.

Should your build systems be up to date?

A recent post on the mailing list for Tao Presentations beta got me thinking: should build systems be up to date?

The poster explains:

Regarding the ubuntu version, was it build with a g++ 4.6 ?
It seems to require a GLIBCXX_3.4.15 which is just above the one I have
on Natty.

I am going to install a g++ 4.6 from source in order to get the right
libstdc++ but I wanted to let you know just in case someone else get the
same issue.

So here is the problem : if we upgrade our build systems to the latest and greatest patch level, we will also include a number of recent dependencies that our customers may not have yet.

Some development environments provide options to select which build tools you use, allowing you to revert to older build configurations. For example, my MacOSX system shows the following variants of G++:

g++            g++-4.0        g++-apple-4.2  g++2
g++-3.3        g++-4.2        g++-mp-4.4     g++3

Unfortunately, that facility doesn’t automatically extend to all the libraries you depend on. For example, we use Qt to develop cross-platform GUI code, and versions before the still unreleased 4.8 version emit compilation warnings while compiling on MacOSX Lion. So if I want to build with Lion, my options are either to build in an unsupported configuration (and stop using the -Wall option in our builds), or to build with a not-yet-released library.

So is the best solution to keep your build systems behind, with all the implied security risks? Is there a better overall strategy I didn’t think of?

Follow

Get every new post delivered to your Inbox.

Join 412 other followers

%d bloggers like this: