It's a Small World

15 May 2020

I came across this article, in which the author (a fan of Rust by the look of it) takes apart outrageous claims by Antony Polukhin that Rust is not highly superior to C++.

Antony is a Boost author and maintainer, and active in the C++ standard committee, so I obviously decided to watch his video (in Russian).

At one point he said something to the effect of “rigged language benchmarks often compare their built-in regex performance with std::regex, and this is unfair, because if your regular expression is known and fixed at compile time, you should use Hana’s CTRE.”

Hana is Hana Dusíková, and CTRE is her compile-time regular expression library.

So I decided to look at CTRE. It’s a pretty interesting library, and the most interesting part of it is the resuling assembly code. The regular expression [0-9]+, for instance, when passed to ctre::search, generates

        lea     rdx, [rcx + rax]
.LBB0_35:
        mov     rdi, rdx
        sub     rdi, rcx
        xor     esi, esi
.LBB0_36:
        movzx   ebx, byte ptr [rcx + rsi]
        add     bl, -48
        cmp     bl, 10
        jae     .LBB0_37
        add     rsi, 1
        cmp     rax, rsi
        jne     .LBB0_36
        mov     rsi, rdi
.LBB0_37:
        test    rsi, rsi
        jne     .LBB0_38
        add     rcx, 1
        add     rax, -1
        cmp     rcx, rdx
        jne     .LBB0_35
.LBB0_27:
        xor     ebp, ebp
        jmp     .LBB0_28
.LBB0_38:
        mov     bpl, 1
.LBB0_28:

which may not be the exact same code you or I would have written, but it’s fairly readable and does the same basic loop and comparisons as a hand-written matcher would.

This is incidentally many times faster than std::regex on the same task. (And still faster, but significantly fewer times, than Boost.Regex, on which the specification of std::regex is based.)

Regular expressions turned out to be a pretty interesting area. I was soon reading Russ Cox’s series of articles (1, 2, 3, 4.) At part 3, it turned out that he’s the author of Google’s RE2 regex library.

Another browser tab had in it open an article describing how and why a search tool called ripgrep and written in Rust was faster than other competing grep tools, and it, too, proved engaging. I learned that Rust’s regex library uses a SIMD algorithm called “Teddy”, invented by Geoffrey Langdale for the Hyperscan regex library. Hyperscan was apparently a startup that was specializing in scanning streaming data (such as network packets) against many regular expressions, and doing so very quickly thanks to insane SIMD optimizations. (It’s been since acquired by Intel, and the library open-sourced.)

I started reading posts by Geoff Langdale, first on the Hyperscan blog, then on his own. Clever SIMD tricks applied to regex scanning were certainly present, but also was a mention of simdjson. So that was where all that SIMD in simdjson came from!

I had heard of this library because of my involvement with proposed-for-Boost.JSON by Vinnie Falco, a new JSON library that also takes performance seriously (among other things.)

It’s a small world.

Peter Dimov

It's a Small World