Right inverting movemask

May 20, 2024

Introduction

movemask instructions take a vector register of values, and construct an integer where the ith bit is set precisely when the higher order byte of the ith lane is set.

For example, the pmovmskb instruction (_mm_movemask_epi8 intrinsic) from the SSE2 instruction set takes a register of 16 bytes, and constructs an integer where the ith bit is set when the high-order bit of the ith byte is set.

For example:

|0xFF|0xFF|0x00|0x00|0x80|0x80|0x00|0x08|0x00|0x00|0x00|0x00|0xFF|0xFF|0xFF|0xFF|

 1    1    0    0    1    1    0    0    0    0    0    0    1    1    1    1

=> 1100110000001111

Sometimes it is useful to go the other way around, and starting from an integer, construct a vector mask where the ith byte is populated when the ith bit is set.

1100110000001111

becomes

|0xFF|0xFF|0x00|0x00|0xFF|0xFF|0x00|0x00|0x00|0x00|0x00|0x00|0xFF|0xFF|0xFF|0xFF|

In other words, construct the right inverse of movemask (why the right inverse? if f denotes this operation, then movemask . f = id, or f is the inverse of movemask on the right; note in general that there is no g with g . movemask = id).

I had wondered how to do this before, and saw a brief description of how to do it in this article, from which I worked out how I’d do it. The article links to a stack overflow post which has other variations. This article aims to condense and explain this information, with a simplification or two thrown in as my own small contribution. These simplifications might come at a performance cost, or they might not - I’ve not measured.

Since this is really about developing techniques and know-how, lets just focus on right inverting pmovmskb.

Right inverting pmovmskb

SSE comparison instructions produce masks across the lanes they compare - for bytewise comparisons, truth-y comparisons produce the mask 0xff, while false-y comparisons produce a mask 0x00. Our example will work by checking the ith bit is set in byte i of an SSE register. To do this, we want two registers, one with the correct byte from the 16-bit mask loaded in each lane (spread), and one register with the correct bit set for that byte (select).

And-ing thes two, and then comparing lanewise that this is equal select produces the mask we want.

select in the above can be gotten with

 _mm_set_epi8(1U << 7, 1U << 6, 1U << 5, 1U << 4,
              1U << 3, 1U << 2, 1U << 1, 1U << 0,
              1U << 7, 1U << 6, 1U << 5, 1U << 4,
              1U << 3, 1U << 2, 1U << 1, 1U << 0);

which is independent of the mask we’re computing.

For spread, if our 16-bit mask is comprised of two bytes AB, we want the register to look like:

AAAAAAAABBBBBBBB

which we can obtain by shuffling bytes in the 16-bit mask with shufflin indices obtained with

 _mm_set_epi8(1, 1, 1, 1, 1, 1, 1, 1,
              0, 0, 0, 0, 0, 0, 0, 0);

once the mask has been moved into an SSE register (_mm_cvtsi32_si128).

Putting this together, we get our inverse:


/*
 * makemask.c
 *
 * Example of right inverting movemask_epi8 using SSE instructions.
 */

#include <emmintrin.h>
#include <tmmintrin.h>

#include <stdint.h>

__m128i makemask_epi8(uint16_t bits)
{
	__m128i lower		= _mm_cvtsi32_si128(bits);
	__m128i spread_bytes	= _mm_set_epi8(1, 1, 1, 1, 1, 1, 1, 1,
				               0, 0, 0, 0, 0, 0, 0, 0);
	__m128i spread		= _mm_shuffle_epi8(lower, spread_bytes);
	__m128i select		= _mm_set_epi8(1U << 7, 1U << 6, 1U << 5, 1U << 4,
				               1U << 3, 1U << 2, 1U << 1, 1U << 0,
				               1U << 7, 1U << 6, 1U << 5, 1U << 4,
				               1U << 3, 1U << 2, 1U << 1, 1U << 0);
	__m128i set		= _mm_and_si128(spread, select);

	return _mm_cmpeq_epi8(set, select);
}

There aren’t that many masks to check, so let’s do a basic check of all possible input values.

#include <assert.h>

int
main()
{
	/* Simple check of all possible values. */
	for (uint16_t ix = 0; ix < (uint16_t)-1; ++ix)
	{
		uint16_t iy = _mm_movemask_epi8(makemask_epi8(ix));
		assert(ix == iy);
	}

	return 0;
}