State of C++ Static Analysis circa 2020

Take the following code:

#include <string_view>

int * f1()
{
    int x = 5;
    return &x;
}

struct V
{
    int * p;
};

V f2()
{
    int x = 5;
    return { &x };
}

std::string_view f3()
{
    char tmp[] = "tmp";
    return tmp;
}

All three functions obviously return dangling pointers to local stack variables. Let’s see what a few major compilers have to say on the matter.

g++ 10.1 -O2 -std=c++2a -fanalyzer -Wall -Wextra (link):

f1():
        xor     eax, eax
        ret
f2():
        lea     rax, [rsp-4]
        ret
f3():
        mov     eax, 3
        lea     rdx, [rsp-4]
        ret
<source>: In function 'int* f1()':
<source>:6:12: warning: address of local variable 'x' returned [-Wreturn-local-addr]
    6 |     return &x;
      |            ^~

In addition to the warning in f1, it even zapped the pointer to nullptr. An interesting choice, with which not everyone agrees, but in my opinion returning a null pointer is much better than returning a dangling pointer to just-deallocated stack memory… which is exactly what happens in f2 and f3.

Let’s try Microsoft cl.exe 19.24 /O2 /std:c++latest /W4 /analyze (link):

<source>(6) : warning C4172: returning address of local variable or temporary: x
<source>(17) : warning C4172: returning address of local variable or temporary: x

That’s better, but not better enough. std::string_view is a rather important type, and a potential rich source of lifetime mistakes.

Maybe Intel icc 19.0.1 -O2 -std=c++17 -Wall -Wextra (link) will fare better?

<source>(6): warning #1251: returning pointer to local variable
      return &x;
             ^

Sadly, not really. clang++ 10.0.0 -O2 -std=c++2a -Wall -Wextra (link) is our last hope.

<source>:6:13: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
    return &x;
            ^
<source>:17:15: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
    return { &x };
              ^

Good but not still good enough.

Everything is lost, then? We’ll never have compilers that catch obvious lifetime mistakes?

Maybe not. Let’s try our real last hope, the experimental -Wlifetime build of clang (link):

<source>:6:13: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
    return &x;
            ^
<source>:6:5: warning: returning a dangling pointer [-Wlifetime]
    return &x;
    ^~~~~~~~~
<source>:6:5: note: pointee 'x' left the scope here
    return &x;
    ^~~~~~~~~
<source>:17:15: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
    return { &x };
              ^
<source>:23:12: warning: address of stack memory associated with local variable 'tmp' returned [-Wreturn-stack-address]
    return tmp;
           ^~~
<source>:23:5: warning: returning a dangling pointer [-Wlifetime]
    return tmp;
    ^~~~~~~~~~
<source>:23:5: note: pointee 'tmp' left the scope here
    return tmp;
    ^~~~~~~~~~

Interesting. Not only did -Wlifetime catch f1 and f3 (but not f2 for some reason!), the normal -Wreturn-stack-address warning caught f3 this time as well, in addition to f1 and f2.

(Herb Sutter has an interesting post about the experimental -Wlifetime compiler. It can’t arrive soon enough if you ask me.)

C++20 Ranges to the Rescue

Vittorio Romeo documents his collision with gamedev Twitter, and in the process shows the following snippet:

template <typename... Images>
TextureAtlas stitchImages(const Images&... images)
{
    const auto width = (images.width + ...);
    const auto height = std::max({images.height...});

    // ...

His point, and he certainly has one, is that pack expansion with fold expressions leads to much cleaner code than the equivalent that would operate on a runtime container of images:

TextureAtlas stitchImages(const std::vector<Image>& images)
{
    std::size_t width = 0;
    for(const auto& img : images)
    {
        width += img.width;
    }

    std::size_t maxHeight = 0;
    for(const auto& img : images)
    {
        maxHeight = std::max(maxHeight, img.height);
    }

    // ...

He then preempts the obvious “use the standard algorithms” retort with

TextureAtlas stitchImages(const std::vector<Image>& images)
{
    const auto width = std::accumulate(images.begin(), images.end(), 0
        [](const std::size_t acc, const Image& img)
        {
            return acc + img.width;
        });

    const auto height = std::max_element(images.begin(), images.end(),
        [](const Image& imgA, const Image& imgB)
        {
            return imgA.height < imgB.height;
        })->height;

    // ...

A-ha, we say to ourselves. He has missed our non-obvious retort, which, in keeping with his “modern C++” theme, is “just use the C++20 range algorithms, which take projections.”

TextureAtlas stitchImages(const std::vector<Image>& images)
{
    const auto width = std::ranges::accumulate( images, 0, {},
        &Image::width );

    const auto height = std::ranges::max( images, {},
        &Image::height );

    // ...

Modern C++ wins again!

Or maybe not. The above certainly feels like it ought to work, but it has two problems. First, when I wrote the above, I certainly expected const auto height to be the maximum image height, but it isn’t. Instead, as written, it actually holds the image with the maximum height, because max returns a reference to the actual element, not the maximum value after the projection.

That’s easily fixed:

    const auto height = std::ranges::max( images, {},
        &Image::height ).height;

but allows me to offer some entirely unsolicited and unwarranted advice. You know those people who tell you to “almost always use auto”? Don’t listen to them.

The second problem with the above snippet is that… std::ranges::accumulate doesn’t exist.

The reason it doesn’t exist is that std::accumulate is defined in <numeric>. While the algorithms in the appropriately named <algorithm> have ranges:: versions in C++20, those in <numeric> do not.

They do not because C++20 got two major features, Ranges and Concepts, at the same time, which meant that the standard library algorithms got to be modernized twice, once to acquire a range-aware version, placed in std::ranges::, and once to be properly constrained via concepts. This, in turn, meant that those concepts needed to be designed.

There wasn’t enough time to properly design the numeric concepts required for the algorithms in <numeric> to be brought up to date, although work on that is underway.

None of which, sadly, helps the C++20 user. (Don’t tell any of this to any game developers.)

It's a Small World

I came across this article, in which the author (a fan of Rust by the look of it) takes apart outrageous claims by Antony Polukhin that Rust is not highly superior to C++.

Antony is a Boost author and maintainer, and active in the C++ standard committee, so I obviously decided to watch his video (in Russian).

At one point he said something to the effect of “rigged language benchmarks often compare their built-in regex performance with std::regex, and this is unfair, because if your regular expression is known and fixed at compile time, you should use Hana’s CTRE.”

Hana is Hana Dusíková, and CTRE is her compile-time regular expression library.

So I decided to look at CTRE. It’s a pretty interesting library, and the most interesting part of it is the resuling assembly code. The regular expression [0-9]+, for instance, when passed to ctre::search, generates

        lea     rdx, [rcx + rax]
.LBB0_35:
        mov     rdi, rdx
        sub     rdi, rcx
        xor     esi, esi
.LBB0_36:
        movzx   ebx, byte ptr [rcx + rsi]
        add     bl, -48
        cmp     bl, 10
        jae     .LBB0_37
        add     rsi, 1
        cmp     rax, rsi
        jne     .LBB0_36
        mov     rsi, rdi
.LBB0_37:
        test    rsi, rsi
        jne     .LBB0_38
        add     rcx, 1
        add     rax, -1
        cmp     rcx, rdx
        jne     .LBB0_35
.LBB0_27:
        xor     ebp, ebp
        jmp     .LBB0_28
.LBB0_38:
        mov     bpl, 1
.LBB0_28:

which may not be the exact same code you or I would have written, but it’s fairly readable and does the same basic loop and comparisons as a hand-written matcher would.

This is incidentally many times faster than std::regex on the same task. (And still faster, but significantly fewer times, than Boost.Regex, on which the specification of std::regex is based.)

Regular expressions turned out to be a pretty interesting area. I was soon reading Russ Cox’s series of articles (1, 2, 3, 4.) At part 3, it turned out that he’s the author of Google’s RE2 regex library.

Another browser tab had in it open an article describing how and why a search tool called ripgrep and written in Rust was faster than other competing grep tools, and it, too, proved engaging. I learned that Rust’s regex library uses a SIMD algorithm called “Teddy”, invented by Geoffrey Langdale for the Hyperscan regex library. Hyperscan was apparently a startup that was specializing in scanning streaming data (such as network packets) against many regular expressions, and doing so very quickly thanks to insane SIMD optimizations. (It’s been since acquired by Intel, and the library open-sourced.)

I started reading posts by Geoff Langdale, first on the Hyperscan blog, then on his own. Clever SIMD tricks applied to regex scanning were certainly present, but also was a mention of simdjson. So that was where all that SIMD in simdjson came from!

I had heard of this library because of my involvement with proposed-for-Boost.JSON by Vinnie Falco, a new JSON library that also takes performance seriously (among other things.)

It’s a small world.

Transforming a Variant

Q: How do I transform a variant<X, Y, Z> V into variant<vector<X>, vector<Y>, vector<Z>>?

A: Use mp_transform<std::vector, V>.

Q: And what if I want to use my own allocator A, that is, want variant<vector<X, A>, vector<Y, A>, vector<Z, A>>?

A: Use mp_transform_q<mp_bind_back<std::vector, A>, V>.

Q: But my allocator is a template. How about variant<vector<X, A<X>>, vector<Y, A<Y>>, vector<Z, A<Z>>>?

A: Use mp_transform<std::vector, V, mp_transform<A, V>>.

Q: Does Mp11 answer every question with a cryptic one liner?

A: Yes.

Variants of 680 Types

Richard Hodges asked me whether Variant2 supports 680 types.

You might be tempted to respond to that with “why would any sane person need a variant of 680 types??”, but we here at Variant2 Enterprises, LLC ask our customers no such questions, particularly not using more than one question mark in the process.

The answer turned out to be a qualified “yes”, the qualification being a fast machine with 128GB RAM.

I had never tried putting as many types into variant2::variant, just kind of assumed that it would of course work. It turned out to hit two problems.

First, due to the way std::variant is specified to support constexpr operations, a specification which variant2 follows, the necessary underlying structure is a recursive union, something like

template<class T1, class... T> union storage
{
    T1 first_;
    storage<T...> rest_;
};

This meant that 680 types caused 680 template instantiations of this storage type. Template instantiations are expensive, both in time and in space. Compiler Explorer’s time limit (and memory quota) were being exceeded for 200 types, let alone 680.

I added a specialization that cut down the storage instantiations by a factor of 10, which enabled the above example to compile on CE.

The second problem was the implementation of mp11::mp_with_index, which powers variant2::visit. It generates a switch of however many alternatives are requested (680 in our case), but did so 16 alternatives at a time, which meant 40+ switch statements. This, again for 200 types, often exceeded the ability of CE’s version of g++.

I changed mp_with_index to divide the alternatives in half, and it conquered.

So now you no longer need 128GB RAM to use variants of 680 types, although they still don’t set any compilation speed records.