Character deletion AVX-512 edition

March 27, 2026

My new ZBook G1a comes with AVX-512 instructions and a healthy memory bandwidth (200GB/s allegedly), so one of the first things I did was write a little kernel to see how fast it can really go. This was my first time using AVX-512 instructions, and so I also wanted to see what new things it can do.

A toy SIMD problem I’ve played with before (see here, here and here) involves deleting characters from a string. In those previous posts, we got around a 4x speedup using SSE instructions and the specialised bit operation pext.

Let’s see how much better we can do with AVX-512.

What’s new?

AVX-512 comes with a tonne of new instructions. A couple are of interest here. First, there are new masked instructions (and a set of mask registers), which lets operations be restricted to operate on selected SIMD lanes. This is useful since loads and stores can be masked, and in particular a mask can be used to constrain a load/store which would otherwise cause a protection fault. This simplifies handling of data which doesn’t perfectly chunk into 512-bit blocks.

Secondly, there are new compress instructions (e.g. vpcompressb) which also make use of the new mask registers. These instructions take lanes with set mask bits and pack them contiguously. In particular we use a store variant of this instruction which will pack lanes contiguously back into memory. This replaces pext followed by a store, and is a bit more like the pbext shuffling utility we developed before, but without the shuffle mask lookup table.

We’ll visit these backwards, by first writing an AVX-512 character deletion kernel which only operates on 64-byte blocks, and then secondly modifying it so it can process a string of arbitrary length.

64-byte chunk version

This is about as straightforward as we might hope for:

static int
delchar_avx512(char c, char str[64], char out[64])
{
	__m512i splat       = _mm512_set1_epi8(c);
	__m512i data        = _mm512_loadu_epi8((void *)str);
	__mmask64 nomatch   = _mm512_cmpneq_epi8_mask(data, splat);

	_mm512_mask_compressstoreu_epi8(out, nomatch, data);
	return _mm_popcnt_u64((uint64_t)nomatch);
}

Line by line, this fills a register with the character c in each byte lane, loads 64-bytes of data from the string str, computes a mask of which lanes don’t match the character c (lanewise, via splat), and then uses the a new compress instructions to pack all of these into the output buffer. Finally, we return the length of what has been stored, which is simply the count of set bits on nomatch.

This is much simpler than the variants we had before!

General version

The above however can be generalized. AVX-512 offers masked load instructions which lets us limit the amount of data we load according to the length of the string. This means we can write a kernel which operates on a general string by computing a mask which captures the maximal (up to 64-byte) chunk we can still take from a string at each step.

The following is a bit more complicated, but not much:

static int
delchar_avx512_masked(char c, int len, char str[len], char out[len])
{
	__m512i splat = _mm512_set1_epi8(c);
	char *co = out;

	while (len > 0)
	{
		int next            = len > 64 ? 64 : len;
		__mmask64 mask      = (uint64_t)-1 >> (64 - next);
		__m512i data        = _mm512_mask_loadu_epi8(splat, mask, str);
		__mmask64 nomatch   = _mm512_cmpneq_epi8_mask(data, splat);
		int count           = _mm_popcnt_u64((uint64_t)nomatch);

		_mm512_mask_compressstoreu_epi8(co, nomatch, data);

		len -= next;
		str += next;
		co += count;
	}

	return (co - out);
}

The mask mask is computed according to the remaining bytes in the string. The masked load warrants a little explanation - the first parameter determines what ends up in the lanes with mask bit set to zero. Here we use splat to fill the characters because these are removed via the nomatch mask computed in the next step. An alternative is to use an arbitrary value (e.g. a zeroed register), and then bitwise-and nomatch with the load mask before computing count and doing the compress-store, but this requires one more instruction - filling the masked lanes with the character to be deleted has the same effect.

The remainder is as in the 64-byte chunk variant, and the core here is wrapped in a loop to process an entire string.

It’s worth mentioning that AVX-512 also comes with 128-bit and 256-bit variants of masked load/store and compress instructions, so we can write versions using these to compare with our previous pext or pbext variants.

Timings

This is a new machine with a Zen 5 AMD RYZEN AI MAX+ PRO 395, running Debian 13 Trixie with gcc 14.2.0. We redo all the measurements of the previous kernels for comparison.

Method Time Throughput
REFERENCE 4.96 seconds 3.21995 GB/s
PEX SWAR 1.93 seconds 8.28743 GB/s
PEX MMX 1.31 seconds 12.1869 GB/s
PEX SSE 1.31 seconds 12.1861 GB/s
PBEXT8 1.72 seconds 9.2862 GB/s
PBEXT16 0.946 seconds 16.9636 GB/s
PEX SSE16 0.861 seconds 18.5166 GB/s
CHECKMATCH 1.91 seconds 8.34062 GB/s
- - -
AVX-512 0.16 seconds 95.7304 GB/s
AVX-512 MASK 0.16 seconds 95.5635 GB/s
AVX-512 MASK 128 0.64 seconds 24.6832 GB/s
AVX-512 MASK 256 0.32 seconds 49.1825 GB/s

The first thing to notice is that we have around a 100% speedup versus the old Intel CPUs as a baseline. The relative performances of the previous kernels remains about the same.

The new AVX-512 variants however are radically faster (5x the previous best at least). The first AVX-512 variant is the first version we presented, without the masking (but inlined at the call site). AVX-512 MASK is the masked variant we presented second, while the versions suffixed 128 and 256 are the same but specialised to 128-bit (SSE) and 256-bit (AVX) registers respectively.

The masked variant performs the same as the unmasked variant – the masked instructions make it much easier to deal with chunks that don’t perfectly match register size, so it’s great to know the performance holds up here too.

Secondly, even the restricted 128-bit and 256-bit masked variants provide a decent (in the 128-bit case) and significant (in the 256-bit case) speed up v.s. the pre-AVX-512 kernels (the AVX2 PEXT version I originally wrote isn’t present in the timings, it was slower than the SSE version, and remains so on the Zen 5 core). Indeed the AVX-512 masked kernels have relative throughput proportional to the register size.

Comparing our original portable reference version, and the full-width AVX-512 kernels, we see that AVX-512 gives us around a 32x increase in throughput. That’s a lot.