Deleting characters in a string - extreme edition

May 30, 2024

Intel’s vector instruction sets (SSE, AVX etc.) are well known, but Intel has added various other instruction set extensions specialised for other tasks. One such task is operating on integer register bitsets.

One toy problem I came across was removing all instances of a character from a string (basically, tr -d), and I got curious about how quickly this might be done. In particular, I figured it would be neat to see if this could be done 8-characters at a time. In this vein, I came across the pext instruction ¹ for packing bytes according to some selection mask.

pext operates as follows. Given a mask m, and a value v it extracts exactly the bits set to 1 in m from v, and packs them in the result, from the lowest bit to the highest bit. For example, if m is 0xff00ff00 and v is 0xaabbccdd, then pext produces 0x0000aacc.

The idea was to use this with a comparison mask to pack only the bytes which don’t match a character in a string².

A reference portable example

Out initial reference example looks like this:

static int
delchar_portable(char c, char str[8], char out[8])
{
	char *po = out;

	for (int ix = 0; ix < 8; ++ix)
	{
		if (str[ix] != c) *po++ = str[ix];
	}

	return po - out;
}

delchar_portable takes a string and a character c to remove from the string, and writes the result out to out, returning the number of characters written out. We’ll use this as a baseline to see if we can do this more quickly.

SWAR variant

My first attempt tried to avoid using special SIMD instructions at all, and doing SIMD within a register (SWAR). I came up with the following

static int
delchar_swar(char c, char str[8], char out[8])
{
	uint64_t chars;
	memcpy(&chars, str, sizeof(chars));
	uint64_t c_splat = 0x0101010101010101 * c;
	uint64_t match_zero = c_splat ^ chars;
	uint64_t reduced = ((match_zero & 0xF0F0F0F0F0F0F0F0) >> 4) | (match_zero & 0x0F0F0F0F0F0F0F0F);
	reduced = ((reduced & 0x0C0C0C0C0C0C0C0C) >> 2) | (reduced & 0x0303030303030303);
	reduced = ((reduced & 0x0202020202020202) >> 1) | (reduced & 0x0101010101010101);
	int remaining = _mm_popcnt_u64(reduced);
	uint64_t mask = reduced * 0xFF;
	uint64_t extracted = _pext_u64(chars, mask);
	memcpy(out, &extracted, sizeof(extracted));

	return remaining;
}

This works, but turns out to be about 50% slower than the portable version. The main work here is computing the mask, which is dominated by producing reduced. While reduced is computing 8 different masks with some parallelism, the steps in its computation necessarily depend on each other, so the relevant instructions can’t be executed in parallel. Note the use of popcnt, which is a more well known bit manipulation instruction.

The next idea was to use SIMD instructions to do the comparisons and get a mask more cheaply.

MMX version

For reasons which are unclear in hindsight³, I went full 90s and reached for MMX instructions (a precursor to SSE, which first appeared on some Pentium processors).

MMX registers and instructions work on 64-bit values in a manner quite similar to SSE. I came up with the following

static int
delchar_mmx(char c, char str[8], char out[8])
{
	uint64_t chars;
	memcpy(&chars, str, sizeof(chars));
	uint64_t c_splat = 0x0101010101010101 * c;
	uint64_t match_zero = c_splat ^ chars;
	uint64_t mask = ~_mm_cvtm64_si64(_m_pcmpeqb(_m_from_int64(match_zero) , _m_from_int(0)));
	uint64_t extracted = _pext_u64(chars, mask);
	memcpy(out, &extracted, sizeof(extracted));

	return _mm_popcnt_u64(mask) >> 3;
}

SSE variant

The MMX variant converts quite easily to an SSE version. These SSE instructions are all available in the SSE2 instruction set extension, available on any x86_64 processor.

static int
delchar_sse(char c, char str[8], char out[8])
{
	uint64_t chars;
	memcpy(&chars, str, sizeof(chars));
	uint64_t c_splat = 0x0101010101010101 * c;
	uint64_t match_zero = c_splat ^ chars;
	uint64_t mask = ~_mm_cvtsi128_si64(_mm_cmpeq_epi8(_mm_cvtsi64_si128(match_zero), _mm_set1_epi8(0)));
	uint64_t extracted = _pext_u64(chars, mask);
	memcpy(out, &extracted, sizeof(extracted));

	return _mm_popcnt_u64(mask) >> 3;
}

Measurement

To give an idea of how these compare for throughput, I generated 64kb of random bytes, and time running each variant (245 * 1024) times, with a random character which is picked at start of day for all runs.

The following is a typical run

Method	Time	Throughput
`PORTABLE`	9.69 seconds	1.65119 GB/s
`SWAR`	14.72 seconds	1.08696 GB/s
`MMX`	3.01 seconds	5.31561 GB/s
`SSE`	2.5 seconds	6.4 GB/s

Typically, the MMX/SSE variant has a 3-5x throughput improvement on the portable version. SSE usually runs a little quicker than the MMX variant, but there isn’t a lot in it. The SSE version can probably be extended to work on 16-characters at a time, using doing two pext and two (dependent) writes to out, from which it might gain some advantage.

Conclusion

This is a reasonable speedup. While this is a toy problem, the purpose of the exploration was to find another technique for bulk processing string data.

the pdep instruction is also interesting for spreading bits out in a register. ↩︎
masks are easy to combine, by and-ing them together, so once we can remove one character, removing many is a simple extension. ↩︎
probably a mixture of the 64-bit register size, and a fond memory of my dad talking very excitedly about 486s, Pentiums, and Pentium MMX machines back when he was assembling 286s and 386s from discarded parts. I don’t know if he knew what MMX really was, but he had some idea it made things fast. ↩︎