17 May 2020
Take the following code:
#include <string_view>
int * f1()
{
int x = 5;
return &x;
}
struct V
{
int * p;
};
V f2()
{
int x = 5;
return { &x };
}
std::string_view f3()
{
char tmp[] = "tmp";
return tmp;
}
All three functions obviously return dangling pointers to local
stack variables. Let’s see what a few major compilers have to say
on the matter.
g++ 10.1 -O2 -std=c++2a -fanalyzer -Wall -Wextra
(link):
f1():
xor eax, eax
ret
f2():
lea rax, [rsp-4]
ret
f3():
mov eax, 3
lea rdx, [rsp-4]
ret
<source>: In function 'int* f1()':
<source>:6:12: warning: address of local variable 'x' returned [-Wreturn-local-addr]
6 | return &x;
| ^~
In addition to the warning in f1
, it even zapped the pointer to nullptr
. An interesting
choice, with which not everyone agrees, but in my opinion returning a null pointer is
much better than returning a dangling pointer to just-deallocated stack memory… which is
exactly what happens in f2
and f3
.
Let’s try Microsoft cl.exe 19.24 /O2 /std:c++latest /W4 /analyze
(link):
<source>(6) : warning C4172: returning address of local variable or temporary: x
<source>(17) : warning C4172: returning address of local variable or temporary: x
That’s better, but not better enough. std::string_view
is a rather important type, and
a potential rich source of lifetime mistakes.
Maybe Intel icc 19.0.1 -O2 -std=c++17 -Wall -Wextra
(link) will fare better?
<source>(6): warning #1251: returning pointer to local variable
return &x;
^
Sadly, not really. clang++ 10.0.0 -O2 -std=c++2a -Wall -Wextra
(link) is our last hope.
<source>:6:13: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
return &x;
^
<source>:17:15: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
return { &x };
^
Good but not still good enough.
Everything is lost, then? We’ll never have compilers that catch obvious lifetime mistakes?
Maybe not. Let’s try our real last hope, the experimental -Wlifetime
build of clang
(link):
<source>:6:13: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
return &x;
^
<source>:6:5: warning: returning a dangling pointer [-Wlifetime]
return &x;
^~~~~~~~~
<source>:6:5: note: pointee 'x' left the scope here
return &x;
^~~~~~~~~
<source>:17:15: warning: address of stack memory associated with local variable 'x' returned [-Wreturn-stack-address]
return { &x };
^
<source>:23:12: warning: address of stack memory associated with local variable 'tmp' returned [-Wreturn-stack-address]
return tmp;
^~~
<source>:23:5: warning: returning a dangling pointer [-Wlifetime]
return tmp;
^~~~~~~~~~
<source>:23:5: note: pointee 'tmp' left the scope here
return tmp;
^~~~~~~~~~
Interesting. Not only did -Wlifetime
catch f1
and f3
(but not f2
for some reason!), the
normal -Wreturn-stack-address
warning caught f3
this time as well, in addition to f1
and f2
.
(Herb Sutter has an interesting post about the experimental
-Wlifetime
compiler. It can’t arrive soon enough
if you ask me.)
16 May 2020
Vittorio Romeo documents
his collision with gamedev Twitter, and in the process shows the following
snippet:
template <typename... Images>
TextureAtlas stitchImages(const Images&... images)
{
const auto width = (images.width + ...);
const auto height = std::max({images.height...});
// ...
His point, and he certainly has one, is that pack expansion with fold
expressions leads to much cleaner code than the equivalent that would
operate on a runtime container of images:
TextureAtlas stitchImages(const std::vector<Image>& images)
{
std::size_t width = 0;
for(const auto& img : images)
{
width += img.width;
}
std::size_t maxHeight = 0;
for(const auto& img : images)
{
maxHeight = std::max(maxHeight, img.height);
}
// ...
He then preempts the obvious “use the standard algorithms” retort with
TextureAtlas stitchImages(const std::vector<Image>& images)
{
const auto width = std::accumulate(images.begin(), images.end(), 0
[](const std::size_t acc, const Image& img)
{
return acc + img.width;
});
const auto height = std::max_element(images.begin(), images.end(),
[](const Image& imgA, const Image& imgB)
{
return imgA.height < imgB.height;
})->height;
// ...
A-ha, we say to ourselves. He has missed our non-obvious retort, which,
in keeping with his “modern C++” theme, is “just use the C++20 range
algorithms, which take projections.”
TextureAtlas stitchImages(const std::vector<Image>& images)
{
const auto width = std::ranges::accumulate( images, 0, {},
&Image::width );
const auto height = std::ranges::max( images, {},
&Image::height );
// ...
Modern C++ wins again!
Or maybe not. The above certainly feels like it ought to work, but it has
two problems. First, when I wrote the above, I certainly expected
const auto height
to be the maximum image height, but it isn’t. Instead,
as written, it actually holds the image with the maximum height, because
max
returns a reference to the actual element, not the maximum value
after the projection.
That’s easily fixed:
const auto height = std::ranges::max( images, {},
&Image::height ).height;
but allows me to offer some entirely unsolicited and unwarranted advice. You
know those people who tell you to “almost always use auto
”? Don’t listen to
them.
The second problem with the above snippet is that… std::ranges::accumulate
doesn’t exist.
The reason it doesn’t exist is that std::accumulate
is defined in <numeric>
.
While the algorithms in the appropriately named <algorithm>
have ranges::
versions in C++20, those in <numeric>
do not.
They do not because C++20 got two major features, Ranges and Concepts, at the
same time, which meant that the standard library algorithms got to be
modernized twice, once to acquire a range-aware version, placed in
std::ranges::
, and once to be properly constrained via concepts. This, in
turn, meant that those concepts needed to be designed.
There wasn’t enough time to properly design the numeric concepts required for
the algorithms in <numeric>
to be brought up to date, although work on that
is underway.
None of which, sadly, helps the C++20 user. (Don’t tell any of this to any game
developers.)
15 May 2020
I came across this article,
in which the author (a fan of Rust by the look of it) takes apart
outrageous claims by Antony Polukhin
that Rust is not highly superior to C++.
Antony is a
Boost author and maintainer,
and active in the C++ standard committee, so I obviously decided to watch
his video (in Russian).
At one point he said something to the effect of “rigged language
benchmarks often compare their built-in regex performance with
std::regex
, and this is unfair, because if your regular expression
is known and fixed at compile time, you should use Hana’s CTRE.”
Hana is Hana Dusíková, and CTRE
is her compile-time regular expression library.
So I decided to look at CTRE. It’s a pretty interesting library, and
the most interesting part of it is the resuling assembly code. The
regular expression [0-9]+
, for instance, when passed to ctre::search
,
generates
lea rdx, [rcx + rax]
.LBB0_35:
mov rdi, rdx
sub rdi, rcx
xor esi, esi
.LBB0_36:
movzx ebx, byte ptr [rcx + rsi]
add bl, -48
cmp bl, 10
jae .LBB0_37
add rsi, 1
cmp rax, rsi
jne .LBB0_36
mov rsi, rdi
.LBB0_37:
test rsi, rsi
jne .LBB0_38
add rcx, 1
add rax, -1
cmp rcx, rdx
jne .LBB0_35
.LBB0_27:
xor ebp, ebp
jmp .LBB0_28
.LBB0_38:
mov bpl, 1
.LBB0_28:
which may not be the exact same code you or I would have written,
but it’s fairly readable and does the same basic loop and
comparisons as a hand-written matcher would.
This is incidentally many times faster than std::regex
on the
same task. (And still faster, but significantly fewer times, than
Boost.Regex, on which the
specification of std::regex
is based.)
Regular expressions turned out to be a pretty interesting area. I
was soon reading Russ Cox’s series of
articles (1,
2,
3,
4.) At part 3, it
turned out that he’s the author of Google’s
RE2 regex library.
Another browser tab had in it open an article
describing how and why a search tool called ripgrep
and written in Rust
was faster than other competing grep
tools, and it, too, proved engaging.
I learned that Rust’s regex library uses a
SIMD algorithm
called “Teddy”, invented by Geoffrey Langdale for the
Hyperscan regex library. Hyperscan was
apparently a startup that was specializing in scanning streaming
data (such as network packets) against many regular expressions,
and doing so very quickly thanks to insane SIMD optimizations. (It’s
been since acquired by Intel, and the library open-sourced.)
I started reading posts by Geoff Langdale, first on the
Hyperscan blog,
then on his own. Clever SIMD tricks applied
to regex scanning were certainly present, but also was a mention
of simdjson. So that was where all that SIMD
in simdjson
came from!
I had heard of this library because of my involvement with
proposed-for-Boost.JSON by Vinnie Falco,
a new JSON library that also
takes performance seriously
(among other things.)
It’s a small world.
14 May 2020
Q: How do I transform a variant<X, Y, Z>
V
into variant<vector<X>, vector<Y>, vector<Z>>
?
A: Use mp_transform<std::vector, V>
.
Q: And what if I want to use my own allocator A
, that is, want variant<vector<X, A>, vector<Y, A>, vector<Z, A>>
?
A: Use mp_transform_q<mp_bind_back<std::vector, A>, V>
.
Q: But my allocator is a template. How about variant<vector<X, A<X>>, vector<Y, A<Y>>, vector<Z, A<Z>>>
?
A: Use mp_transform<std::vector, V, mp_transform<A, V>>
.
Q: Does Mp11 answer every question with a cryptic one liner?
A: Yes.
13 May 2020
Richard Hodges asked me whether
Variant2 supports 680 types.
You might be tempted to respond to that with “why would any sane person
need a variant of 680 types??”, but we here at Variant2 Enterprises, LLC
ask our customers no such questions, particularly not using more than one
question mark in the process.
The answer turned out to be a qualified “yes”, the qualification being a
fast machine with 128GB RAM.
I had never tried putting as many types into variant2::variant
, just kind
of assumed that it would of course work. It turned out to hit two problems.
First, due to the way std::variant
is specified to support constexpr
operations, a specification which variant2
follows, the necessary underlying
structure is a recursive union, something like
template<class T1, class... T> union storage
{
T1 first_;
storage<T...> rest_;
};
This meant that 680 types caused 680 template instantiations of this
storage
type. Template instantiations are expensive, both in time and
in space. Compiler Explorer’s time limit (and memory quota) were
being exceeded for 200 types, let alone
680.
I added a specialization
that cut down the storage instantiations by a factor of 10, which enabled
the above example to compile on CE.
The second problem was the implementation of
mp11::mp_with_index
,
which powers variant2::visit
. It generates a switch of however many
alternatives are requested (680 in our case), but did so 16 alternatives
at a time, which meant 40+ switch statements. This, again for 200 types, often
exceeded the ability of CE’s version of g++.
I changed
mp_with_index
to divide the alternatives in half, and it
conquered.
So now you no longer need 128GB RAM to use variants of 680 types, although
they still don’t set any compilation speed records.