Parallel byte delete 16-wide

January 9, 2025

Following on from previous investigations into deleting characters, and deleting characters using MMX instructions it remains to be seen what can be done using the full 16-byte with of SSE registers.

There are two approaches to consider - building a mask using SSE instructions and using two pext instructions to place the 64-bit upper and lower pieces together again (an extension of the SSE variant considered in the first post), and secondly, an extension of the shuffling approach from the previous post, where we shuffle two 8-byte chunks in a single SIMD register, and then write them out separately.

We also trial a portable approach which optimizes for the case that the character to be deleted is relatively rare. This works by adapting our reference portable approach, and adds a fast path check for a matched character to be deleted.

Shuffle extension

We can extend the shuffling example from last time

static inline int
pbext16_si128(uint16_t mask, __m128i data, char out[16])
{
        int len = pbext_len[mask & 0xFF];
        int len = pbext_len[mask >> 8];

        __m128i shuffle_mask_low = _mm_cvtsi64_si128(shuffle[mask & 0xFF]);
        uint64_t upper_shuffle = shuffle[mask >> 8] | 0x0808080808080808;
        __m128i shuffle_mask_high = _mm_slli_si128(_mm_cvtsi64_si128(upper_shuffle), 8);
        __m128i shuffle_mask = _mm_or_si128(shuffle_mask_low, shuffle_mask_high);
        __m128i chunks = _mm_shuffle_epi8(data, shuffle_mask);

        _mm_storeu_si64(out, chunks);
        _mm_storeu_si64(out + len_low, _mm_srli_si128(chunks, 8));

        return len_low + len_high;
}

static inline int
delchar_pbext16(char c, const char in[16], char out[16])
{
        __m128i data = _mm_lddqu_si128((void *)in);
        __m128i splat = _mm_set1_epi8(c);
        __m128i match = _mm_cmpeq_epi8(data, splat);

        int mask = _mm_movemask_epi8(match);

        return pbext16_si128(~mask & 0xFFFF, data, out);
}

We’re reusing the lookup tables (shuffle and len) from the last post. The real work here is done using the 16-byte byte extract function. This operates in two chunks, the lower chunk is very similar to before, using the lookup table to shuffle bytes in the lower half of an SSE register. The upper chunk is the same principle, except the lookup table needs offsetting by 8 (the lower chunk), so we do a lane-wise bitwise or with 0x08 for each byte (upper_shuffle). This is like adding 0x08 on every lane, except some lanes are set to 0xFF and since we don’t want to cause overflows (causing interference on other lanes) we use a bitwise or instead of an add, which works here because all of the shuffle masks have the high bit of the first nibble unset (i.e. are 0x07 or less).

These two shuffle masks are combined to perform a single shuffle organizing two 8-byte chunks.

We then write the chunks out sequentially and return the sum of the lengths of the preserved chunks.

Bit extract extension

static int
delchar_sse16(char c, char str[16], char out[16])
{
        __m128i data = _mm_lddqu_si128((void *)str);
        __m128i splat = _mm_set1_epi8(c);
        __m128i match = _mm_cmpeq_epi8(data, splat);
        __m128i keep = _mm_cmpeq_epi8(match, _mm_set1_epi8(0));

        uint64_t keep_low = _mm_cvtsi128_si64(keep);
        uint64_t keep_high = _mm_cvtsi128_si64(_mm_srli_si128(keep, 8));

        uint64_t data_low = _mm_cvtsi128_si64(data);
        uint64_t data_high = _mm_cvtsi128_si64(_mm_srli_si128(data, 8));

        uint64_t extract_low = _pext_u64(data_low, keep_low);
        uint64_t extract_high = _pext_u64(data_high, keep_high);

        int len_low = _mm_popcnt_u64(keep_low) >> 3;
        int len_high = _mm_popcnt_u64(keep_high) >> 3;

        memcpy(out, &extract_low, sizeof(extract_low));
        memcpy(out + len_low, &extract_high, sizeof(extract_high));

        return len_low + len_high;
}

This is a little simpler, once pext has been understood. We figure out what bytes we want to keep in a mask (keep), and then work toward two pext instructions on the upper and lower halves of the SSE register (by pulling them out as uint64_ts).

We then write out the chunks in a similar fashion to the last example.

For whatever reason, I used a memcpy dependendent on len_low and len_high to begin with - this had awful performance (> 2x worse than my reference implemenation!), switching to a statically sized variant helped enormously. It’s fairly clear that this is exactly what you’d expect, but worth pointing out. The static memcpy here is optimized to a single (unaligned) write (looking at a Linux build today, using gcc 12.2.0).

Portable with fastpath

The final variant is an optimization built straight on top of the original portable C variant. This isn’t 16-byte wide, and uses just regular 8-byte integer registers.

This makes use of a trick I learned from stb_sprintf (although the same trick seems to be used in e.g. musl libc) to check a block of characters for a zero byte. Before doing any string repacking, we first check to see if any byte matches the character we want to delete. If not, we copy the entire 8-byte chunk. If there is a match, we fall back to the portable variant.

static int
delchar_portable(char c, const char str[8], char out[8])
{
        char *po = out;

        for (int ix = 0; ix < 8; ++ix)
        {
                if (str[ix] != c) *po++ = str[ix];
        }

        return po - out;
}

static int
delchar_checkmatch(char c, const char str[8], char out[8])
{
        uint64_t chars;
        memcpy(&chars, str, sizeof(chars));

        uint64_t splat = 0x0101010101010101 * c;
        uint64_t match = chars ^ splat;

        /* If there is a character to delete, use the portable variant,
         * otherwise just copy the entire 8-byte chunk */
        if ((match - 0x0101010101010101ULL) & (~match & 0x8080808080808080ULL))
        {
                return delchar_portable(c, str, out);
        }

        memcpy(out, str, 8);
        return 8;
}

The check for a zero byte works by observing that if 0x01 is subtracted lanewise from match (in the above), then the upper bit of the rightmost zero byte must be flipped, and conversely if an upper bit of a lane is flipped, then the rightmost one must have been a zero. Note that this check is existential, it tells you there is at least one zero, but doesn’t preserve location for all lanes.

Measurement

It’s quite clear all of these procedures are quite variable when it comes to measurement.

The first measurements I made of the shuffle variant outperformed the previous MMX pext variant, and was within 1GB/s of the SSE pext variant from the first post.

In further measurements, the pext8 variant seemed to catch up a lot, while pext16 slowed a little.

OpenBSD defaults to a robust and secure compilation, at the cost of performance, so I’ll also compare with a linux build (on a newer, but slower processor).

The 16-wide SSE pext variant above, after the inital memcpy blooper, came out on top, peaking at 8Gb/s throughput.

It was nice to see that on random buffer data the fast pathed portable version performed pretty well, which you would expect because the common case takes the fast path. If the character to delete occurs densely in the input, it ought to descend toward (slightly worse than) the portable variant. Still, given it uses no special instructions, this is likely a generally useful optimization that works well in most cases.

All of these measurements are on my laptops, so shouldn’t be taken to be precise, but as rough comparisons.

OpenBSD `clang` 16.0.6 (Intel Core i7-6600U @ 2.60GHz)

Method	Time	Throughput
`REFERENCE`	8.54 seconds	1.87354 GB/s
`PEX SWAR`	6.57 seconds	2.43531 GB/s
`PEX MMX`	2.61 seconds	6.13027 GB/s
`PEX SSE`	2.27 seconds	7.04846 GB/s
`PBEXT8`	3.86 seconds	4.14508 GB/s
`PBEXT16`	3.58 seconds	4.46927 GB/s
`PEX SSE16`	2.07 seconds	7.72947 GB/s
`CHECKMATCH`	4.41 seconds	3.62812 GB/s

Debian Linux `gcc` 12.2.0 (Intel Core i5-5200U @ 2.20GHz)

Method	Time	Throughput
`REFERENCE`	15.0348 seconds	1.0642 GB/s
`PEX SWAR`	5.12729 seconds	3.12056 GB/s
`PEX MMX`	3.15063 seconds	5.07834 GB/s
`PEX SSE`	3.15402 seconds	5.0729 GB/s
`PBEXT8`	3.78366 seconds	4.22871 GB/s
`PBEXT16`	3.28742 seconds	4.86703 GB/s
`PEX SSE16`	2.43846 seconds	6.56151 GB/s
`CHECKMATCH`	4.32702 seconds	3.69769 GB/s

Conclusions

All of this is to say there is quite a lot you can do to speed up deleting characters from a string (and quite a lot to learn from messing around with toy problems).

pex based approaches ended up fastest here, and there is probably more that can be done to eek out a bit more. I did try a naive conversion of the PEX SSE16 approach to AVX2, aiming to process 32 characters at a time, but this ended up running slower than the shuffle variants. I’ve not done any analysis into why, or done anything to try and improve this. This would probably be an illuminating investigation. An immediate suspicion is the increased number of dependencies, and the fact that pext only seems to have single execution port (so increasing the number of pexts in a batch can effectively become a bottleneck), but I’ve not even taken a look at the assembly output for this, so it could be something daft.

I particularly like the pbext8 based approach (using a table lookup and a shuffle). It provides a decent improvement while remaining relatively simple, and presumably working reasonably well on older Zen architectures (although I haven’t measured).

The fastpathed “checkmatch” variant is also nice, although dependent on the input data. For a lot of problems this is a significant gain while remaining portable (assuming everything is twos complement, which it practically is as of 2025).