During a meeting today, a colleague of mine shared the belief that exception handling had no impact on optimizations in modern C++. While we did everything we could 20+ years ago to ensure that all kinds of optimizations were possible, there is a residual cost that you can trigger. This is section 3.8 in the paper if you are curious.

Fibonacci computed using template instantiation

Let’s start with this simple C++ code:

template<unsigned N>
struct A
{
    A(): i() {}
    ~A() {}
    unsigned S() const { return i.S(); }
    struct I : A<N-1>, A<N-2>
    {
        unsigned S() const
        {
            return A<N-1>::S() + A<N-2>::S();
        }
    } i;
};

template<> struct A<0> { unsigned S() const { return 1; } };
template<> struct A<1> { unsigned S() const { return 1; } };

int main()
{
    A<SIZE> a;
    return a.S();
}

What this code is doing is asking the compiler to compute the Fibonacci sequence using template instantiation. The Fibonacci sequence was chosen because it requires recursive instantiation, and as such, is somewhat costly in terms of number of calls.

Quizz: Why do I need struct I?

If we compile this code and look at the generated assembly code with -fnoexception, we get the following output:

% c++ -DSIZE=30 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && cat /tmp/glop.s
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0	sdk_version 12, 3
	.globl	_main                           ## -- Begin function main
	.p2align	4, 0x90
_main:                                  ## @main
	.cfi_startproc
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	movl	$1346269, %eax                  ## imm = 0x148ADD
	popq	%rbp
	retq
	.cfi_endproc
                                        ## -- End function
.subsections_via_symbols

So even for A<30>, the compiler managed to compute the constant value directly, and the generated code essentially puts that value, $1346269, into a register. The optimizations work quite well in that case.

If we count the lines in the output assembly. we have 18. We will use that as a metric from here on.

When optimizations work, the cost can be zero

Now, if you compile the same code with exceptions, you get exactly the same result. So far, so good, the cost of exceptions was zero. The compiler could notably determine that it did not need to generate exception tables or landing pads.

Let us now make the same thing, but with a minor change, namely adding a constructor and a destructor in the inner class I that call two inline functions called construct() and destruct(), the rest of the code being the same:

inline void construct() {}
inline void destruct() {}

template<unsigned N>
struct A
{
[...]
    struct I : public A<N-1>, public A<N-2>
    {
        I() { construct(); }
        ~I() { destruct(); }
[...]
    } i;
};

Again, the generated code is exactly the same. The code size is still 18 lines. So far, so good. Still zero cost.

Optimizing side effect computations

Compilers are even smart enough nowadays to correctly evaluate side effects of the various calls on global variables. For example, consider this implementation of our little helper functions:

unsigned ctors = 0, dtors = 0;
inline void construct() { ctors++; }
inline void destruct() { dtors++; }

In that case, the compiler will be able to deduce how many times the functions would be called:

% c++ -DSIZE=30 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && cat /tmp/glop.s
[...]
	addl	$1346268, _ctors(%rip)          ## imm = 0x148ADC
	addl	$1346268, _dtors(%rip)          ## imm = 0x148ADC
	movl	$1346269, %eax                  ## imm = 0x148ADD

The size of our code is now 24, so it has not markedly increased. Great job, compiler!

Hiding the functions

Let us now assume that we hide the function definitions:

extern void construct(), destruct();

The generated code in that case becomes noticeably longer:

c++ -DSIZE=30 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && wc -l /tmp/glop.s
     932 /tmp/glop.s

So the code is now 932 lines long, which is 51 times bigger than the initial one. In a sense, this is not overly surprising, since we added a bit of work to do. So the compiler generated a large number of actual function calls.

Interestingly, the number of calls generated does not match the number of dynamic calls computed statically. For A<30>, are 91 calls to construct, and 91 calls to destruct (10 of each being tail calls). For A<40>, there are 126 calls, 15 of them being tail calls.

Let’s make the compiler sweat

What we are doing with this kind of code is that we are putting a lot of pressure on the poor compiler. The problem with that is that the effort for the compiler is non-linear with the value of SIZE in the code above. As a matter of fact, if you try to increase the size, you quickly hit various compiler limits:

1  0.05 real 0.02 user 0.01 sys  18 0 0
2  0.04 real 0.02 user 0.01 sys  20 1 1
3  0.07 real 0.03 user 0.01 sys  22 2 2
4  0.05 real 0.03 user 0.01 sys  26 4 4
5  0.05 real 0.03 user 0.01 sys  32 7 7
6  0.05 real 0.03 user 0.01 sys  42 12 12
7  0.06 real 0.04 user 0.01 sys  58 20 20
8  0.06 real 0.04 user 0.01 sys  84 33 33
9  0.06 real 0.04 user 0.01 sys  132 34 34
10  0.07 real 0.05 user 0.01 sys  179 35 35
11  0.07 real 0.05 user 0.01 sys  199 37 37
12  0.08 real 0.05 user 0.01 sys  228 40 40
13  0.08 real 0.06 user 0.02 sys  285 45 45
14  0.08 real 0.06 user 0.01 sys  300 46 46
15  0.08 real 0.06 user 0.01 sys  336 52 52
16  0.09 real 0.06 user 0.01 sys  394 53 53
17  0.10 real 0.08 user 0.01 sys  446 54 54
18  0.09 real 0.07 user 0.01 sys  466 56 56
19  0.10 real 0.08 user 0.02 sys  497 59 59
20  0.10 real 0.08 user 0.02 sys  556 64 64
21  0.10 real 0.08 user 0.01 sys  571 65 65
22  0.11 real 0.08 user 0.01 sys  607 71 71
23  0.11 real 0.09 user 0.01 sys  665 72 72
24  0.11 real 0.09 user 0.02 sys  717 73 73
25  0.12 real 0.09 user 0.01 sys  737 75 75
26  0.12 real 0.10 user 0.01 sys  764 78 78
27  0.13 real 0.11 user 0.02 sys  823 83 83
28  0.14 real 0.11 user 0.01 sys  838 84 84
29  0.15 real 0.13 user 0.02 sys  874 90 90
30  0.16 real 0.13 user 0.02 sys  932 91 91
31  0.17 real 0.14 user 0.02 sys  984 92 92
32  0.19 real 0.17 user 0.01 sys  1004 94 94
33  0.22 real 0.20 user 0.02 sys  1031 97 97
34  0.27 real 0.24 user 0.02 sys  1090 102 102
35  0.35 real 0.32 user 0.02 sys  1105 103 103
36  0.46 real 0.43 user 0.01 sys  1141 109 109
37  0.65 real 0.62 user 0.02 sys  1199 110 110
38  0.96 real 0.93 user 0.02 sys  1251 111 111
39  1.45 real 1.42 user 0.02 sys  1271 113 113
40  2.25 real 2.20 user 0.03 sys  1298 116 116
41  3.57 real 3.51 user 0.04 sys  1357 121 121
42  5.94 real 5.81 user 0.08 sys  1372 122 122
43  9.52 real 9.33 user 0.11 sys  1408 128 128
44  15.16 real 14.83 user 0.16 sys  1466 129 129
45  23.99 real 23.62 user 0.22 sys  1518 130 130
46  40.89 real 39.67 user 0.37 sys  1542 132 132
47  62.51 real 61.13 user 0.51 sys  1578 135 135
48  /tmp/glop.cpp:27:5: warning: stack frame size (512559704) exceeds limit (4294967295) in function 'main' [-Wframe-larger-than]
int main()
    ^
1 warning generated.
103.61 real 101.16 user 0.95 sys  1645 140 140
49  /tmp/glop.cpp:27:5: warning: stack frame size (3483774824) exceeds limit (4294967295) in function 'main' [-Wframe-larger-than]
int main()
    ^
1 warning generated.
170.82 real 166.37 user 1.57 sys  1659 141 141
50  /tmp/glop.cpp:27:5: warning: stack frame size (3996334552) exceeds limit (4294967295) in function 'main' [-Wframe-larger-than]
int main()
    ^
1 warning generated.
266.99 real 262.95 user 2.29 sys  1715 147 147

In short, a modern compiler, on a machine with 16G of memory, starts having trouble compiling the code above for sizes above 48. It also takes close to 3 minutes to give me that result for value 49, and over 4 minutes for 50.

We make the compiler suffer, because we have a non-linear cost for compiling that code:

This is true despite the fact that the number of calls and the size of the code grows roughly linearly (which to be honest, was a surprise to me, it shows really smart optimizations):

Exception handling tables

Let us now compile with -fexceptions. Here, the assembly code jumps to 5202 lines of code for A<30>. In other words, the code became about 5 times larger. If we plot the code size relative to the size the problem, we get a roughly constant overhead factor:

This overhead corresponds to the amount of code that needs to be generated to deal with the various exit possibilities.

Code generation change

What is more interesting is if we contrast the number of calls to construct and destruct:

In the no-exception case, the number of generated calls to construct and destruct is identical, and is plotted in blue on the graph. In the exception case, the number of calls to destruct in the generated code is higher than the number of calls to construct. Presumably, the reason for this is that there are multiple distinct exit paths when destroying a partially-constructed object (e.g. if an exception is thrown from destruct(), the you end up calling terminate())

What is even more important is what happens at the left of the graph. Notably on the blue line, you see something that starts with what looks like a quadratic behavior, followed by a more linear one. What this tells us is that there are resource limits placed by the compiler for its optimization, something often called an optimization budget. For low-enough values of SIZE, the compiler probably is favoring some additional inlining, but beyond a certain limit, it has to disable inlining.

The obvious conclusion is that the code with and without exception handling is not identical. Once you reach the compiler optimization limits, the mere possibility of exceptions changes the generated code in an observable way.

Can we make it worse? Of course we can!

Let’s make the compiler suffer

See, so far, we were quite nice: we only gave a relatively simple code structure to the compiler. If you look at it, there is only one constructor for class A<10>, irrespective of where you are in the computation of the Fibonacci sequence.

The reason for selecting Fibonacci initially, however, was to be able to prevent the compiler from doing that optimization. In other words, we can make it so that the structure for the various classes being used in the computation depend on the path we follow along the computation. This can be done by embedding the actual computation path inside the class name. For example, changing the code a bit, we can do:

extern "C" void construct(), destruct();

template <unsigned N, typename T = void>
struct A
{
    using D = A;
    A()                 { construct(); }
    ~A()                { destruct(); }
    A<N-1, D> l;
    A<N-2, D> r;
    unsigned S()        { return l.S() + r.S(); }
};

template<typename T> struct A<0, T> { unsigned S() { return 1; } };
template<typename T> struct A<1, T> { unsigned S() { return 1; } };

int main()
{
    A<SIZE> a;
    return a.S();
}

What we changed with this little update is that we now force the compiler to generate different constructors for all the possible computation paths. Each constructor can throw an exception, so now the landing pads and recovery code for that exception needs to be generated. And that applies recursively to all the objects inside.

Pretty quickly, the compiler gives up any idea of inlining, but even so, the number of calls to construct and destruct explodes, and the compiler now struggles dealing with cases as small as 20:

1  0.15 real 0.08 user 0.05 sys  18 0 0
2  0.17 real 0.09 user 0.06 sys  70 1 1
3  0.20 real 0.11 user 0.06 sys  171 2 4
4  0.22 real 0.14 user 0.06 sys  461 4 12
5  0.26 real 0.17 user 0.06 sys  829 7 21
6  0.30 real 0.22 user 0.06 sys  1369 12 34
7  0.41 real 0.32 user 0.07 sys  2383 20 58
8  0.61 real 0.50 user 0.08 sys  3944 33 95
9  0.90 real 0.77 user 0.09 sys  6406 54 154
10  1.29 real 1.15 user 0.10 sys  10535 88 252
11  2.02 real 1.82 user 0.13 sys  17133 143 409
12  3.53 real 3.21 user 0.19 sys  27747 232 662
13  5.60 real 5.09 user 0.29 sys  45065 376 1074
14  9.03 real 8.22 user 0.44 sys  73004 609 1739
15  14.18 real 13.04 user 0.62 sys  118148 986 2814
16  21.17 real 19.28 user 0.84 sys  191337 1596 4556
17  30.51 real 28.37 user 1.17 sys  309677 2583 7373
18  43.44 real 40.48 user 1.49 sys  501095 4180 11930
19  58.55 real 55.71 user 1.70 sys  810959 6764 19306
20  178.14 real 157.27 user 7.28 sys  1312244 10945 31239
21  351.97 real 276.47 user 17.54 sys  2123276 17710 50546

The result in terms of number of calls generated in the code completely dwarfs what we had in the previous code example, and goes completely non-linear:

So does overall code size:

In such a case, the difference in code size between compiling with an without exception handling becomes such a constraint that the compiler visibly changes strategy after a given size to limit the overhead of exception handling on the generated code:

In short, by purposefully building a pathological case, we can cause exception handling to cause almost arbitrarily large overhead on the generated code, and as a result to de-optimize it much earlier than it would otherwise have.

How bad can the de-optimization be?

Now, one might argue that this is just impacting the exceptional path, but has little impact on the main optimization path. Showing an earlier version of this article to my coworker, he essentially held that position. He notably wrote something that shows quite a bit of maturity in understanding what is required to deal with exceptions:

More specifically, there are no prologues or epilogues that all functions would execute all the time regardless of exceptions thrown or not as was the case before the zero-overhead implementations took over.

And, indeed, the whole point of the exception handling design we pioneered on IA-64, and the reason it became almost ubiquitous today, is that you can perform pretty neat forms of optimizations. That does not mean, however, that enabling exception handling cannot have a huge impact on the main path. To illustrate, let’s make the compiler grind a bit more for us.

After all, in order to generate exception-related work, we can simply have a call to construct or destruct in the innermost class. That’s sufficient, and it simplifies the generated code enough to show the effect beautifully. So the code now looks like this:

extern "C" void construct(), destruct();

template <unsigned N, typename T = void>
struct A
{
    using D = A;
    A()  {}
    ~A() {}
    A<N-1, D> l;
    A<N-2, D> r;
    unsigned S()        { return l.S() + r.S(); }
};

template<typename T> struct A<0, T>
{
    A()  { construct(); }
    ~A() { destruct(); }
    unsigned S() { return 1; }
};
template<typename T> struct A<1, T> { unsigned S() { return 1; } };

int main()
{
    A<SIZE> a;
    return a.S();
}

The generated code fo A<5> with exceptions disabled now looks like this, which is pretty neat and tidy:

c++ -DSIZE=5 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && cat /tmp/glop.s
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0	sdk_version 12, 3
	.globl	_main                           ## -- Begin function main
	.p2align	4, 0x90
_main:                                  ## @main
	.cfi_startproc
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	callq	_construct
	callq	_construct
	callq	_construct
	callq	_destruct
	callq	_destruct
	callq	_destruct
	movl	$8, %eax
	popq	%rbp
	retq
	.cfi_endproc
                                        ## -- End function
.subsections_via_symbols

Let us now run the exact same code with exception handling. Now, the main path code no longer looks so tidy at all. The main function alone now looks like this:

c++ -DSIZE=5 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fexceptions && cat /tmp/glop.s
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0	sdk_version 12, 3
	.globl	_main                           ## -- Begin function main
	.p2align	4, 0x90
_main:                                  ## @main
Lfunc_begin0:
	.cfi_startproc
	.cfi_personality 155, ___gxx_personality_v0
	.cfi_lsda 16, Lexception0
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	subq	$16, %rsp
	leaq	-8(%rbp), %rdi
	callq	__ZN1AILj5EvEC2Ev
Ltmp0:
	callq	_destruct
Ltmp1:
## %bb.1:
Ltmp3:
	callq	_destruct
Ltmp4:
## %bb.2:
Ltmp6:
	callq	_destruct
Ltmp7:
## %bb.3:
	movl	$8, %eax
	addq	$16, %rsp
	popq	%rbp
	retq
LBB0_6:
Ltmp8:
	movq	%rax, %rdi
	callq	___clang_call_terminate
LBB0_5:
Ltmp5:
	movq	%rax, %rdi
	callq	___clang_call_terminate
LBB0_4:
Ltmp2:
	movq	%rax, %rdi
	callq	___clang_call_terminate
Lfunc_end0:
	.cfi_endproc
	.section	__TEXT,__gcc_except_tab
	.p2align	2
GCC_except_table0:
Lexception0:
	.byte	255                             ## @LPStart Encoding = omit
	.byte	155                             ## @TType Encoding = indirect pcrel sdata4
	.uleb128 Lttbase0-Lttbaseref0
Lttbaseref0:
	.byte	1                               ## Call site Encoding = uleb128
	.uleb128 Lcst_end0-Lcst_begin0
Lcst_begin0:
	.uleb128 Lfunc_begin0-Lfunc_begin0      ## >> Call Site 1 <<
	.uleb128 Ltmp0-Lfunc_begin0             ##   Call between Lfunc_begin0 and Ltmp0
	.byte	0                               ##     has no landing pad
	.byte	0                               ##   On action: cleanup
	.uleb128 Ltmp0-Lfunc_begin0             ## >> Call Site 2 <<
	.uleb128 Ltmp1-Ltmp0                    ##   Call between Ltmp0 and Ltmp1
	.uleb128 Ltmp2-Lfunc_begin0             ##     jumps to Ltmp2
	.byte	1                               ##   On action: 1
	.uleb128 Ltmp3-Lfunc_begin0             ## >> Call Site 3 <<
	.uleb128 Ltmp4-Ltmp3                    ##   Call between Ltmp3 and Ltmp4
	.uleb128 Ltmp5-Lfunc_begin0             ##     jumps to Ltmp5
	.byte	1                               ##   On action: 1
	.uleb128 Ltmp6-Lfunc_begin0             ## >> Call Site 4 <<
	.uleb128 Ltmp7-Ltmp6                    ##   Call between Ltmp6 and Ltmp7
	.uleb128 Ltmp8-Lfunc_begin0             ##     jumps to Ltmp8
	.byte	1                               ##   On action: 1
Lcst_end0:
	.byte	1                               ## >> Action Record 1 <<
                                        ##   Catch TypeInfo 1
	.byte	0                               ##   No further actions
	.p2align	2
                                        ## >> Catch TypeInfos <<
	.long	0                               ## TypeInfo 1
Lttbase0:
	.p2align	2
                                        ## -- End function
	.section	__TEXT,__text,regular,pure_instructions
	.globl	__ZN1AILj5EvEC2Ev               ## -- Begin function _ZN1AILj5EvEC2Ev
	.weak_def_can_be_hidden	__ZN1AILj5EvEC2Ev
	.p2align	4, 0x90

The really tricky part is that the nice sequence of calls we had before could no longer be inlined, because we need to be able to isolate where the exception was thrown by the bottom constructor, if that happens, and to do that precisely enough to be able to figure out exactly which partially-constructed objects need to be cleaned up.

So this time, the de-optimization is not just a matter of limited resources (after all, the code is pretty small in both cases), but of correctness. The compiler has to generate less efficient code because it now needs to distinguish where an exception was thrown, in order to correctly destroy potentially partially-constructed objects. As a result, it ends up generating a much more complicated call sequence in the main evaluation path.

Conclusion

While C++ compilers became really good at optimizing even in the presence of exceptions, the conclusions I reached in paragraph 3.8 of the article I wrote in 1999 remain valid even today.

In short, you can very often consider that exceptions have no cost in C++, but it’s better to verify if that hypothesis is true by actually measuring the performance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s