Encoding binary as hex using SIMD instructions

April 26, 2024

On and off over the last year or two I’ve been trying to learn more about doing sophisticated things with SIMD instructions. There is a quite a gap between the easy examples (doing arithmetic on big batches of data in a purely lanewise fashion), and validating utf-8 encoded bytes, and not a lot of readable material to bridge the gap.

I’ve been on the lookout for examples between the very easy and the more sohpisticated, for my own practice and learning, but also to provide a more tractable write-up introducing some of the instructions which can be used for more complicated tasks.

The following is in that vein - taking raw bytes and formatting them as hexidecimal text. The problem is essentially still lanewise, but it’s an opportunity to explore some of the more interesting instructions which let us do better than the compiler (clang 13.0.0 in my case), even when it spots the opportunity to vectorise (SIMD-ify).

SIMD? SSE?

SIMD (single instruction multiple data) lets a single instruction operate on batches of data, typically packed into a single register.

x86_64 has various SIMD instructions, but the most commonly available are the so called Streaming SIMD Extensions (SSE), which we will use here. SSE added 16-byte registers to x86, and a variety of operations on these registers. For example, the addps instruction takes two 16-byte registers, and treats them as an array of 4 packed 32-bit floats, adding the floats at the same index in each array (“lanewise”). C compilers expose these instructions for use via intrinsics.

Some instructions/intrinsics

Let’s start with an overview of the instructions and intrinsics we’ll be using. This lets us introduce a few of the things we can do with SIMD intrinsics, and show their use in small (and hopefully clearer) use cases.

__m128i

Not an instruction at all, this is just the type used in compatible C compilers to describe 16-byte integers.

I’ll refer to these as 16-byte registers in what follows, but the compiler might not actually use a register to store these values (for example, they might be stored in memory, and then loaded when needed).

_mm_set_epi8

_mm_set_epi8 takes a sequence of 16 bytes and packs them into a 16-bytes. Note that the ordering is not what you might expect intuitively (but is consistent with x86_64 being little endian; byte 0 of the corresponding 128-bit integer occurs first).

So for example:

_mm_set_epi8('f', 'e', 'd', 'c', 'b', 'a', '9', '8', '7', '6', '5', '4', '3', '2', '1', '0');
             

produces a packed 16-byte value which looks like

| '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' |

We’ll make use of this example as a lookup table for 4-bit nibbles - the ith lane corresponds to the hexidecimal digit for i.

_mm_set_epi8 doesn’t correspond directly to an instruction - the compiler synthesizes something with the same effect. Looking at the assembly output from an example program, this can for example e stored to the data segment of a binary and loaded into a register when needed.

_mm_set1_epi8

_mm_set1_epi8 takes a byte x argument, and installs that byte in all lanes

| x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x |

As with mm_set_epi8 the compiler synthesizes this.

_mm_load_si128, _mm_store_si128

Perform loads and stores from 16-byte aligned addresses into a 16-byte register. These both correspond to the instruction movdqa, and in the typical x86_64 manner will be encoded slightly differently depending on whether the memory operand is the destination or source.

_mm_and_si128

This corresponds to an assembly instruction pand which does a bitwise-and between two 16-byte registers.

So, to extract the lower nibble of each of 16-bytes in an __m128i named chunk, we can do

__m128i mask	= _mm_set1_epi8(0x0F);
__m128i nibbles	= _mm_and_si128(chunk, mask);

_mm_srl_epi32

This is treats a 16-byte wide register as 4 32-byte integers (there isn’t a corresponding instruction for bytes). It takes as a second argument a 128-bit integer, and when this integer is less than UINT_MAX64, it shifts each 4-byte integer in the first argument by that many bits right.

For example, shifting right by 4 bits, a 16-byte register looks something like:

| i >> 4 | j >> 4 | k >> 4 | l >> 4 |

This corresponds to the instruction prsld.

_mm_cvtsi32_si128

Copies a 32-bit integer into the lower bits of a 16-byte value. This corresponds to the instruction movd.

We can use this extract the high-nibbles of a 16-byte chunk as follows:

__m128i mask	= _mm_set1_epi8(0x0F);
__m128i shift	= _mm_cvtsi32_si128(4);
__m128i nibbles	= _mm_and_si128(_mm_srl_epi32(chunk, shift), mask);

_mm_shuffle_epi8

This is the first instruction which does something a bit non-obvious. The corresponding instruction is pshufb.

_mm_shuffle_epi8(t, l) essentially does a parallel set of table lookups using the lower nibbles l as indices into t as a table of 16-bytes (a nibble - 4-bits - can index from 0 to 15). _mm_shuffle_epi8(t, l) ignores the upper nibbles of each byte of l unless the high-bit is 1, in which case _mm_shuffle_epi8 masks this lookup and puts 0 in the output.

To get an idea of how this works, if

t = | x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | xa | xb | xc | xd | xe | xf |
l = | 1  | 0  | 3  | 2  | 5  | 3  | 7  | 6  | 9  | 8  | b  | a  | d  | c  | f  | e  |

then _mm_shuffle_epi8 would swap bytes pairwise:

_mm_shuffle_epi8(t, l) = | x1 | x0 | x3 | x2 | x5 | x4 | x7 | x6 | x9 | x8 | xb | xa | xd | xc | xf | xe |

To get an idea of how the masking works, suppose we wanted to collect all the first bytes of each pair, and zero the second byte:

have = | x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | xa | xb | xc | xd | xe | xf |
want = | x0 | 0  | x2 | 0  | x4 | 0  | x6 | 0  | x8 | 0  | xa | 0  | xc | 0  | xe | 0  |

We can get want with _mm_shuffle_epi8(have, l), where:

l = | 0 | ff | 2 | ff | 4 | ff | 6 | ff | 8 | ff | a | ff | c | ff | e | ff |

(the same works replacing each 0xff with 0x80, but ff makes the hexidecimal base clear).

We’ll use this to convert each nibble of a byte to its corresponding hexidecimal ASCII character.

Shuffling is quite powerful, and a lot of clever algorithms have been produced using it.

_mm_unpacklo_epi8

This corresponds to the instuction punpcklbw. This takes two 16-byte registers, takes the lower 8-bytes of each, and then interleaves these bytes in a 16-byte register.

For example, if

x = | x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | xa | xb | xc | xd | xe | xf |
y = | y0 | y1 | y2 | y3 | y4 | y5 | y6 | y7 | y8 | y9 | ya | yb | yc | yd | ye | yf |

Then _mm_unpacklo_epi8(x, y) looks like

| x0 | y0 | x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 | x5 | y5 | x6 | y6 | x7 | y7 |

_mm_unpackhi_epi8

This corresponds to the instruction punpckhbw, and does as with _mm_unpacklo_epi8, but with the ihigher bytes of its arguments. Continuing the example in _mm_unpacklo_epi8, we would have _mm_unpackhi_epi8(x, y) look like

| x8 | y8 | x9 | y9 | xa | ya | xb | yb | xc | yc | xd | yd | xe | ye | xf | yf |

How are we going to do this?

Given all of this, we’re going to convert 16-byte chunks into a hexidecimal string by following these steps:

  1. Load a 16-byte chunk into a register.
  2. Mask the low nibbles.
  3. Shift and mask the high nibbles.
  4. Map the low nibbles onto their hexidecimal digit (_mm_shuffle_epi8).
  5. Map the upper nibbles onto their hexidecimal digit (_mm_shuffle_epi8).
  6. Recover the hex encoding of the first 8-bytes by interleaving upper and lower nibbles (_mm_unpacklo_epi8).
  7. Recover the hex encoding of the last 8-bytes by interleaving upper and lower nibbles (_mm_unpackhi_epi8).
  8. Write these hex encodings out into a 32-byte buffer.

Note we’ll arrange that we’re always dealing with 16-byte aligned chunks of sufficient size. We can easily arrange our buffers to be aligned and of sufficient size, and if need be we can pad the input and truncate afterwards if needed. We could also use unaligned loads and stores - there doesn’t appear to be any difference in cost on modern chips.

Putting it together

Now we just need to write it out.

#include <tmmintrin.h>

static void
simdhex(const void *src, uint8_t out[32])
{
	__m128i enc	= _mm_set_epi8('f', 'e', 'd', 'c',
	                               'b', 'a', '9', '8', 
	                               '7', '6', '5', '4',
	                               '3', '2', '1', '0');

	__m128i mask	= _mm_set1_epi8(0x0F);
	__m128i shift	= _mm_cvtsi32_si128(4);

	__m128i chunk	= _mm_load_si128(src);

	__m128i lower	= _mm_and_si128(chunk, mask);
	__m128i shiftr	= _mm_srl_epi32(chunk, shift);
	__m128i upper	= _mm_and_si128(shiftr, mask);

	__m128i hexl	= _mm_shuffle_epi8(enc, lower);
	__m128i hexu	= _mm_shuffle_epi8(enc, upper);

	__m128i enc0	= _mm_unpacklo_epi8(hexu, hexl);
	__m128i enc1	= _mm_unpackhi_epi8(hexu, hexl);

	_mm_store_si128((__m128i *)out, enc0);
	_mm_store_si128((__m128i *)(out + 16), enc1);
}

Note to compile this you’ll need to tell the C compiler to use SSSE3 instructions (_mm_shuffle_epi8). With gcc/clang, passing -m=ssse3 should do. Most modern x86_64 chips will support this.

Is this any faster than what the compiler can do?

I decided to do a couple of quick measurements encoding randomly generated bytes. I compared with two implementations:

  1. LOOKUP - this uses a 16-character lookup table and maps the nibbles one-by-one.
  2. DIRECT - this uses a branch-free expression to map nibbles to characters.

In the second case (DIRECT), the compiler vectorises the loop, and comes out pretty fast. It’s still takes twice as long (roughly) on the same input compared with our version above. The compiler spots that it can build a register of upper nibbles encoded as hex, and a register of lower nibbles encoded as hex, and then interleave them. What it doesn’t seem to do so well is the translation from nibbles to hex characters; instead it computes both in parallel '0' + n and 'a' + n for each nibble and then uses a comparison mask to select the right one. This is roughly the branch-free expression used to compute the character, so it’s not crazy the compiler decides to do it this way. The lookup table doesn’t get optimized at all. The assembly appears to just read bytes straight of of memory each time. Even though this table probably remains in cache, this still ends up an order of magnitude slower than our SIMD version.

As a third datapoint, I replaced the hex translation with two memcpys which just copy the 16-byte chunk back out to memory twice (to fill 32-bytes as with the hex encoded version). This gives something approximating a laws of physics upper-bound where we do next to no processing.

The following table summarises the results.

Method Time Throughput
SIMD 1.68 seconds 9.52381 GB/s
LOOKUP 13.42 seconds 1.19225 GB/s
DIRECT 2.57 seconds 6.22568 GB/s
MEMCPY 1.09 seconds 14.6789 GB/s

It might be possible to do better still, but this is not bad at all!

Updates

I realised after the fact that there is a better shift instruction which operates on the whole 128-bit integer, taking an immediate. _mm_bsrli_si128, or psrldq lets us compute shiftr as follows:

	__m128i shift = _mm_bsrli_si128(chunk, 4);

Removing the need for declaring a shift value (and making the whole thing a bit more consistent).

It turns out psrld can encode an immediate as an operand, so at the instruction level there isn’t a large difference for our purposes.

References