Parallel byte extract using MMX shuffle

January 8, 2025

In a discussion on Twitter somebody pointed out to me that the pext instruction on AMD processors is much slower than on Intel, up until fairly recent processors (Zen 3). I lost the thread for this, but I learned from that that pext can be emulated using shuffle instructions and a lookup table for shuffles. This explores and measures this approach using the character deletion example from a previous post. It’s probably worth reading that post first to understand the flavour of instruction we’re emulating.

Approach

For the character deletion example, we don’t need bitpacking as done with pext - we actually only want to pack bytes. There are plenty of SIMD instructions which operate on bytes, so we can make use of those instead.

Given a block of 8 characters, and a 8-bit mask describing which characters we want to keep (set bits indicate the characters that should be kept), its quite straightforward to compute a shuffle mask that will repack the characters to keep, and discard the rest.

For example, if we have a mask 00001001 (bits 0 and 3 set), the corresponding shuffle mask is

0xFFFFFFFFFFFF0301

(noting the high bit being set in a shuffle mask zeros that lane).

Indeed the program below generates the mapping between mask and shuffle mask, in a way that can be pasted into a C99 array initializer to create a 8 * 256 = 2KB lookup table.

#include <assert.h>
#include <stdio.h>
#include <stdint.h>

int
main(void)
{
        for (int i = 0; i < 0xFF + 1; ++i)
        {
                uint64_t shuffle = 0;
                int next = 0;

                for (uint64_t b = 0; b < 8; ++b)
                {
                        if (i & (1 << b))
                        {
                                shuffle |= b << (next * 8);
                                next++;
                        }
                }

                if (next < 8)
                {
                        shuffle |= ((uint64_t)-1 << (next * 8));
                }

                printf("\t[0x%.2X] = 0x%.16llX,\n", i, shuffle);

                assert(next == __builtin_popcount(i));
        }

	return 0;
}

If such a lookup table were to be called shuffle, we can define a byte packing function as described above (pbext8 - parallelel byte extract 8-way).

static inline int
pbext8(uint8_t mask, const char in[8], char out[8])
{
	__m64 shuffle_mask = _m_from_int64(shuffle[mask]);
        __m64 bytes;
        memcpy(&bytes, in, sizeof(bytes));
        __m64 extract = _mm_shuffle_pi8(bytes, shuffle_mask);
        memcpy(out, &extract, sizeof(extract));

        return len[mask];
}

len[mask] here is the same as __builtin_popcount(mask) - it just turns out that defining this also as a lookup table executes a bit faster.

This allows us to write a character deletion procedure by

Splatting the character across 8 lanes of an MMX register
Comparing the character strings to identify which characters need to be deleted (lane set to 0xFF).
Creating a mask of characters to delete using movemask.
Using pbext8 above with the inverted mask (the inverted mask is the characters we want to keep).

In code:

static inline int
delchar_pbext8(char c, const char str[8], char out[8])
{
        uint64_t chars;
        memcpy(&chars, str, sizeof(chars));
        uint64_t splat = 0x0101010101010101 * c;
        __m64 match = _mm_cmpeq_pi8(_m_from_int64(chars), _m_from_int64(splat));
        int mask = _mm_movemask_pi8(match);
        return pbext8(~mask & 0xFF, str, out);
}

Measurement and analysis

Running the same test as in the prior article (timing looping the character deletion on a big buffer of random characters), we can compare approaches. The new approach is at the bottom, PBEXT8.

Method	Time	Throughput
`REFERENCE`	8.87 seconds	1.80383 GB/s
`PEX SWAR`	6.56 seconds	2.43902 GB/s
`PEX MMX`	2.68 seconds	5.97015 GB/s
`PEX SSE`	2.34 seconds	6.83761 GB/s
`PBEXT8`	3.21 seconds	4.98442 GB/s

We can see this performs pretty well, a fair speedup over the portable and SWAR versions (the SWAR version is running a bit faster than the portable one in this measurement, although this is not consistent). It’s still slower than the MMX and SSE based pex variants (on my Intel machine), but not far off.

Disassembling the timing binary, we can see the instructions mirror the procedure quite naturally. A snippet of the core of the procedure is below (it’s inlined and a few iterations are unrolled in the binary I compiled). Particularly nice is that the table lookup of the shuffle mask can be expressed as a memory operand to pshufb. The two xchg instructions bracketing the shuffle are a bit odd - the memory operand presumably could be rewritten as (%rdx, %r9, 8), but nevermind.

	movq        0x8(%rdi,%r12,1),%mm1
	movq        %mm1,%mm2
	pcmpeqb     %mm0,%mm2
	pmovmskb    %mm2,%r9d
	xor         $0xff,%r9
	xchg        %rdx,%r9
	pshufb      (%r9,%rdx,8),%mm1
	xchg        %rdx,%r9
	movq        %mm1,(%r8)

The movemask instruction is from the SSE instruction set, so available on all x86_64, while the shuffle requires SSSE3, so by this time these instructions are available on most x86_64 processors.

Like the SSE variant from the original post, it’s likely this can be done 16-bytes at a time (using the same lookup table and merging the masks).