Character deletion AVX-512 edition
March 27, 2026
My new ZBook G1a comes with AVX-512 instructions and a healthy memory bandwidth (200GB/s allegedly), so one of the first things I did was write a little kernel to see how fast it can really go. This was my first time using AVX-512 instructions, and so I also wanted to see what new things it can do.
A toy SIMD problem I’ve played with before (see here, here and here) involves deleting characters from
a string. In those previous posts, we got around a 4x speedup using
SSE instructions and the specialised bit operation pext.
Let’s see how much better we can do with AVX-512.
What’s new?
AVX-512 comes with a tonne of new instructions. A couple are of interest here. First, there are new masked instructions (and a set of mask registers), which lets operations be restricted to operate on selected SIMD lanes. This is useful since loads and stores can be masked, and in particular a mask can be used to constrain a load/store which would otherwise cause a protection fault. This simplifies handling of data which doesn’t perfectly chunk into 512-bit blocks.
Secondly, there are new compress instructions (e.g. vpcompressb)
which also make use of the new mask registers. These instructions take
lanes with set mask bits and pack them contiguously. In particular we
use a store variant of this instruction which will pack lanes
contiguously back into memory. This replaces pext followed by a
store, and is a bit more like the pbext shuffling utility we developed
before, but without the shuffle mask lookup table.
We’ll visit these backwards, by first writing an AVX-512 character deletion kernel which only operates on 64-byte blocks, and then secondly modifying it so it can process a string of arbitrary length.
64-byte chunk version
This is about as straightforward as we might hope for:
static int
delchar_avx512(char c, char str[64], char out[64])
{
__m512i splat = _mm512_set1_epi8(c);
__m512i data = _mm512_loadu_epi8((void *)str);
__mmask64 nomatch = _mm512_cmpneq_epi8_mask(data, splat);
_mm512_mask_compressstoreu_epi8(out, nomatch, data);
return _mm_popcnt_u64((uint64_t)nomatch);
}
Line by line, this fills a register with the character c in each
byte lane, loads 64-bytes of data from the string str, computes a
mask of which lanes don’t match the character c (lanewise, via
splat), and then uses the a new compress instructions to pack all of
these into the output buffer. Finally, we return the length of what
has been stored, which is simply the count of set bits on nomatch.
This is much simpler than the variants we had before!
General version
The above however can be generalized. AVX-512 offers masked load instructions which lets us limit the amount of data we load according to the length of the string. This means we can write a kernel which operates on a general string by computing a mask which captures the maximal (up to 64-byte) chunk we can still take from a string at each step.
The following is a bit more complicated, but not much:
static int
delchar_avx512_masked(char c, int len, char str[len], char out[len])
{
__m512i splat = _mm512_set1_epi8(c);
char *co = out;
while (len > 0)
{
int next = len > 64 ? 64 : len;
__mmask64 mask = (uint64_t)-1 >> (64 - next);
__m512i data = _mm512_mask_loadu_epi8(splat, mask, str);
__mmask64 nomatch = _mm512_cmpneq_epi8_mask(data, splat);
int count = _mm_popcnt_u64((uint64_t)nomatch);
_mm512_mask_compressstoreu_epi8(co, nomatch, data);
len -= next;
str += next;
co += count;
}
return (co - out);
}
The mask mask is computed according to the remaining bytes in the
string. The masked load warrants a little explanation - the first
parameter determines what ends up in the lanes with mask bit set to
zero. Here we use splat to fill the characters because these are
removed via the nomatch mask computed in the next step. An
alternative is to use an arbitrary value (e.g. a zeroed register), and
then bitwise-and nomatch with the load mask before computing count
and doing the compress-store, but this requires one more instruction -
filling the masked lanes with the character to be deleted has the same
effect.
The remainder is as in the 64-byte chunk variant, and the core here is wrapped in a loop to process an entire string.
It’s worth mentioning that AVX-512 also comes with 128-bit and 256-bit
variants of masked load/store and compress instructions, so we can
write versions using these to compare with our previous pext or
pbext variants.
Timings
This is a new machine with a Zen 5 AMD RYZEN AI MAX+ PRO 395, running Debian 13 Trixie with gcc 14.2.0. We redo all the measurements of the previous kernels for comparison.
| Method | Time | Throughput |
|---|---|---|
| REFERENCE | 4.96 seconds | 3.21995 GB/s |
| PEX SWAR | 1.93 seconds | 8.28743 GB/s |
| PEX MMX | 1.31 seconds | 12.1869 GB/s |
| PEX SSE | 1.31 seconds | 12.1861 GB/s |
| PBEXT8 | 1.72 seconds | 9.2862 GB/s |
| PBEXT16 | 0.946 seconds | 16.9636 GB/s |
| PEX SSE16 | 0.861 seconds | 18.5166 GB/s |
| CHECKMATCH | 1.91 seconds | 8.34062 GB/s |
| - | - | - |
| AVX-512 | 0.16 seconds | 95.7304 GB/s |
| AVX-512 MASK | 0.16 seconds | 95.5635 GB/s |
| AVX-512 MASK 128 | 0.64 seconds | 24.6832 GB/s |
| AVX-512 MASK 256 | 0.32 seconds | 49.1825 GB/s |
The first thing to notice is that we have around a 100% speedup versus the old Intel CPUs as a baseline. The relative performances of the previous kernels remains about the same.
The new AVX-512 variants however are radically faster (5x the previous
best at least). The first AVX-512 variant is the first version we
presented, without the masking (but inlined at the call
site). AVX-512 MASK is the masked variant we presented second, while
the versions suffixed 128 and 256 are the same but specialised to
128-bit (SSE) and 256-bit (AVX) registers respectively.
The masked variant performs the same as the unmasked variant – the masked instructions make it much easier to deal with chunks that don’t perfectly match register size, so it’s great to know the performance holds up here too.
Secondly, even the restricted 128-bit and 256-bit masked variants provide a decent (in the 128-bit case) and significant (in the 256-bit case) speed up v.s. the pre-AVX-512 kernels (the AVX2 PEXT version I originally wrote isn’t present in the timings, it was slower than the SSE version, and remains so on the Zen 5 core). Indeed the AVX-512 masked kernels have relative throughput proportional to the register size.
Comparing our original portable reference version, and the full-width AVX-512 kernels, we see that AVX-512 gives us around a 32x increase in throughput. That’s a lot.