It's a Small World
15 May 2020I came across this article, in which the author (a fan of Rust by the look of it) takes apart outrageous claims by Antony Polukhin that Rust is not highly superior to C++.
Antony is a Boost author and maintainer, and active in the C++ standard committee, so I obviously decided to watch his video (in Russian).
At one point he said something to the effect of “rigged language
benchmarks often compare their built-in regex performance with
std::regex
, and this is unfair, because if your regular expression
is known and fixed at compile time, you should use Hana’s CTRE.”
Hana is Hana Dusíková, and CTRE is her compile-time regular expression library.
So I decided to look at CTRE. It’s a pretty interesting library, and
the most interesting part of it is the resuling assembly code. The
regular expression [0-9]+
, for instance, when passed to ctre::search
,
generates
lea rdx, [rcx + rax]
.LBB0_35:
mov rdi, rdx
sub rdi, rcx
xor esi, esi
.LBB0_36:
movzx ebx, byte ptr [rcx + rsi]
add bl, -48
cmp bl, 10
jae .LBB0_37
add rsi, 1
cmp rax, rsi
jne .LBB0_36
mov rsi, rdi
.LBB0_37:
test rsi, rsi
jne .LBB0_38
add rcx, 1
add rax, -1
cmp rcx, rdx
jne .LBB0_35
.LBB0_27:
xor ebp, ebp
jmp .LBB0_28
.LBB0_38:
mov bpl, 1
.LBB0_28:
which may not be the exact same code you or I would have written, but it’s fairly readable and does the same basic loop and comparisons as a hand-written matcher would.
This is incidentally many times faster than std::regex
on the
same task. (And still faster, but significantly fewer times, than
Boost.Regex, on which the
specification of std::regex
is based.)
Regular expressions turned out to be a pretty interesting area. I was soon reading Russ Cox’s series of articles (1, 2, 3, 4.) At part 3, it turned out that he’s the author of Google’s RE2 regex library.
Another browser tab had in it open an article
describing how and why a search tool called ripgrep
and written in Rust
was faster than other competing grep
tools, and it, too, proved engaging.
I learned that Rust’s regex library uses a
SIMD algorithm
called “Teddy”, invented by Geoffrey Langdale for the
Hyperscan regex library. Hyperscan was
apparently a startup that was specializing in scanning streaming
data (such as network packets) against many regular expressions,
and doing so very quickly thanks to insane SIMD optimizations. (It’s
been since acquired by Intel, and the library open-sourced.)
I started reading posts by Geoff Langdale, first on the
Hyperscan blog,
then on his own. Clever SIMD tricks applied
to regex scanning were certainly present, but also was a mention
of simdjson. So that was where all that SIMD
in simdjson
came from!
I had heard of this library because of my involvement with proposed-for-Boost.JSON by Vinnie Falco, a new JSON library that also takes performance seriously (among other things.)
It’s a small world.