Register shuffling

March 9, 2025

Compilers are full of cute little data structure and algorithm problems. The following is a neat example of the kind of problem that emerges in a number of places, for example, shuffling values around in registers during code generation. It’s also a demonstration of a neat kind of abstraction that got distilled from solving a few related problems. These problems had solutions which inevitably ended up being buggy in various similar ways, for similar reasons. This little tool helped make these solutions much more robust.

This also turned out to be a nice demonstration of small bitsets. We don’t use the full capability of this representation, but it ends up being convenient to be able to copy sets as primitive values.

The shuffling problem

Abstractly, the problem boils down to describing a function (mathematically speaking) on a small set of integers in terms of two primitive operations: move (copy) and swap.

For illustrations sake (and as it turns out, it’s a useful way to test), we’re going to suppose we have 4 registers, modelled as an array of length 4, each holding their index as an initial value.

int slots[4] = { 0, 1, 2, 3 };

Given any other array of length 4 containing digits 0-3, we want to produce a sequence of move or swap commands to produce the given array from the starting position.

For example, if the target is

int want[4] = { 1, 0, 2, 2 };

We might produce the commands

SWAP 0, 1
MOVE 2 -> 3

A more complicated example (to keep track on), might be

int want[4] = { 3, 0, 1, 2 };

For which the sequence

SWAP 0, 3
SWAP 1, 3
SWAP 2, 3

works. This is slightly harder to see, and shows that we must keep track on where values end up, and not overwrite anything which must be preserved for later writes.

An API

We introduce a structure struct shuf, which is used to keep track of shuffling state. Zeroed is initialized.

struct shuf s = { 0 };

We allow the desired state to be declared through a procedural interface. For example, if we were targeting the above

int want[4] = { 3, 0, 1, 2 };

we might encode this as a series of shuf_add(&s, i, want[i]):

shuf_add(&s, 0, 3);
shuf_add(&s, 1, 0);
shuf_add(&s, 2, 1);
shuf_add(&s, 3, 2);

From this we will produce a stream of actions.

It’s useful to represent these actions abstractly, since this lends itself to interpreting these actions differently, depending on the context. Integer (general purpose) registers must be handled differently to floating point registers, for example. In our case here, it’s useful to be able to interpret these actions directly on an array, in order to test the code.

These actions are a small tagged union:

struct shuf_action
{
	enum { SHUF_MOVE, SHUF_SWAP } action;
	int from;
	int to;
};

The core procedure is shuf_next, which determines the next action to take, if any. This is best illustrated by an example, in this case showing how it might be interpreted on our slots array above.

struct shuf_action a;
while (shuf_next(&s, &a))
{
	switch (a.action)
	{
		case SHUF_MOVE:
		{
			slots[a.to] = slots[a.from];
		}
		break;

		case SHUF_SWAP:
		{
			int tmp = slots[a.to];
			slots[a.to] = slots[a.from];
			slots[a.from] = tmp;
		}
		break;
	}
}

The sequence of actions produced, once executed, should produce the desired state. Any registers (slots) not mentioned in shuf_add as destinations should be left untouched (i.e. the default state is shuf_add(i, i) for each i).

In this example we used arrays of length 4 for demonstration. There are only 256 possible end results for this small system, so we can easily exhaustively test this example.

In practice, for x86_64, we’ll be working with values between 0 and 15 (i.e. 16 items) - there are 16 general purpose registers and 16 floating point registers. The same code extends naturally to any number of items smaller than the number of bits in a machine primitive. Beyond that, the approach works, but it needs encoding differently.

Approach

Your first instinct might be to find a register, and try and move a value into it. It becomes clear quite quickly that what you need to track is where the value in a given register must ultimately be placed. This turns out to be all you need.

#define SHUF_COUNT  16

typedef uint16_t shuf_mask;

struct shuf
{
	shuf_mask   forward[SHUF_COUNT];
};

For example, if s has type struct shuf, and and bit d is set in s->forward[r], then the value currently in slot r must eventually be written to slot d.

slot_add sets this up. The interface doesn’t police dst and src strongly - add assertions to check this as needed.

void
shuf_add(struct shuf *s, int dst, int src)
{
	if (dst == src)
	{
		return;
	}

	s->forward[src] |= 1 << dst;
}

For shuf_next it’s useful to have a helper which finds the next set bit in a shuf_mask - we use a builtin available in gcc/clang.

#define shuf_mask_next(m)   __builtin_ctz(m)

This represents a single instruction (tzcnt) on x86_64. A portable version (albeit a touch slower) can also be written for other compilers.

The below is the complete implementation of shuf_next. The breakdown and analysis follows the listing.

int
shuf_next(struct shuf *s, struct shuf_action *a)
{
	int i = 0;
	shuf_mask visited = 0;

	while (i < SHUF_COUNT)
	{
		if (s->forward[i] == 0)
		{
			++i;
			continue;
		}

		int next = shuf_mask_next(s->forward[i]);

		if (s->forward[next] == 0)
		{
			a->action   = SHUF_MOVE;
			a->from     = i;
			a->to       = next;

			s->forward[i] &= ~(1 << next);

			return 1;
		}

		if (visited & (1 << next))
		{
			a->action   = SHUF_SWAP;
			a->from     = i;
			a->to       = next;

			shuf_mask tmp    = s->forward[next] & ~(1 << i);
			s->forward[next] = s->forward[i] & ~(1 << next);
			s->forward[i]    = tmp;

			return 1;
		}

		visited |= 1 << i;
		i = next;
	}

	return 0;
}

The core of shuf_next is a while loop which terminates once all work has been completed (s->forward[i] is zero for all i). The first if block inside the while loop serves to advance to the first slot which contains a value which needs to be propagated into another slot.

This could probably be encoded more efficiently, by e.g. keeping a mask of slots with work to do, and jumping straight to one which we can try and progress, or simply storing the index of the next slot to progress, but I decided to keep this simple and just redo the loop each iteration - tuning this probably isn’t too difficult.

Progressing beyond this branch means slot i needs its value copied into another slot - we pick a destination slot with next and try and resolve this propagation.

The first branch after defining next handles the case where the value in next is no longer needed, and we can safely copy the value from slot i into next. We clear the bit for next in s->forward[i] to mark this work done, and emit the action for handling.

If the value in slot next is still needed, we drill down and redo the loop with i set to next, and try and make progress on next.

If we can make progress with just simple copies, this all works out. If not, since there are only finitely many slots, this process must eventually loop, so visited is used to mark slots we’ve already visited so that we can detect this. If we detect a loop, we emit a swap command for slot i and next. This moves the value in slot i into the desired position in next, so we can clear the relevant bit (s->forward[i] & ~(1 << next)). We must also swap the bitmasks for these slots since the we swapped the values in them.

Once a value is in the correct slot, it will never be moved, so we can always clear bit j for slot j (see tmp for example in the swapping of bitmasks). Maintaining that bit i is never set in slot i ensures that i != next in every iteration of the loop. The alternative is to handle the i == next case explicitly, which also works.

Because SHUF_COUNT is finite, and in each loop i != next, we must eventually hit one of the if branches below the definition of next. Each of these clears at least one bit, and there are only finitely many bits in the s->forward array, so eventually all bits must be cleared, and so repeated calls to shuf_next must eventually return zero to denote all work done.

Application

This is useful for a range of problems in code generation. For example,