Packing small sparse arrays for extension data

November 15, 2025

This post documents a small technique I invented (in the sense that I haven’t seen if before, although I wouldn’t be surprised if there is prior art) for packing small sparse arrays into dense arrays. It’s part of the class of techniques using bit operations (on x86_64 for example tzcnt, popcnt) not natively supported in C/C++ and which require use of compiler builtins or intrinsics.

Motivating example

The problem that got me thinking about this involved adding attributes to procedures/global data in my compiler. Most procedures don’t have any special attributes attached to them,

f :: (x : i32) -> i32
x : i32 = 42;

for which defaults should be assumed. Others may have a variety of optional attributes which impact code generation in a variety of ways.

@inline
@section(".text.procs")
f :: (x : i32) -> i32

@align(64)
y : i32 = -42;

Some of these attributes are effectively booleans (@inline), while others have additional data specified (@section(".text.procs"), align(64)).

Procedures/global data are described by fixed size structures, so attributes need to either all be bundled into that structure (while mostly unused), or there needs to be a reference to lookup these attributes when present.

It might be completely reasonable in practice to keep all of this data right there in the structure (depending on the dominant access pattern), but I was contemplating the approaches that could be taken to keep it aside (assuming its access is cold), and not use more storage than really necessary.

The most basic approach might be to have an array of attributes attached to each procedure (or even a linked list if there is merit to each attribute occupying different amounts of space), and then iterating these attributes when they need looking up (there are unlikely to be many attributes). To save hitting this memory for attributes that aren’t there, flags can be set in the core structure to indicate which attributes are present. First check the flags, and then search the array for the correct attribute and data.

Note that using flags has the bonus of being ideal for storing boolean attributes (like @inline).

We’ll be pursuing the flags approach much further - for the purposes of this writeup, assume flags are stored in a 32-bit integer. This can be extended quite naturally.

However, it would be nice to jump straight to a piece of attribute data, without having to search an array for the correct piece. This also has the advantage that the kind of attribute doesn’t need to be stored with the data (in other words, attributes don’t need to be tagged).

One way to do this is to make the array sparse

struct attr_data attrs[32];

Where attrs[i] contains the data corresponding to the ith flag set in the attributes flags. This allows lookup without needing to iterate the array (and so fewer data dependencies), but wastes space on boolean attributes which are never used. It’s also all or nothing; the entire array is needed whenever any attribute data needs to be tracked.

Flag indexing

We continue with the approach of using flags to track presence of attributes.

For the moment, let’s assume all attributes have associated data we need to keep. For a set of flags flags, if some number n of bits are set, it would be ideal if we could allocate an array of n pieces of attribute data, where the ith piece corresponds to the ith set bit in flags.

The key to achieving this is knowing there are instructions which count set bits on most modern architectures.

The following procedure helps us do the job.

#define popcount32  __builtin_popcount

int
flag_idx(uint32_t flags, uint32_t pick)
{
	assert((pick & (pick - 1)) == 0);   /* Pick is a single set bit */
	assert(flags & pick);               /* pick is set in flags */

	uint32_t mask = pick + (pick - 1);
	return popcount32(flags & mask) - 1;
}

flags is our collection of bits indicating which procedures are present, and pick is an individual one of those flags which is set. If pick is the ith set bit, this procedure returns i.

A few assertions provide some examples of what to expect, and make everything more concrete.

{
	uint32_t flags = 1 << 0 | 1 << 3 | 1 << 5;
	assert(flag_idx(flags, 1 << 0) == 0);
	assert(flag_idx(flags, 1 << 3) == 1);
	assert(flag_idx(flags, 1 << 5) == 2);
	assert(flag_idx(flags | 1 << 2, 1 << 5) == 3);
}

The core of flag_idx compiles to a sequence of very basic instructions (gcc will need -mpopcnt to avoid synthesizing the operation popcnt instruction):

	lea     -0x1(%rsi,%rsi,1),%eax
	and     %edi,%eax
	popcnt  %eax,%eax
	sub     $0x1,%eax
	ret

Better still, it’s likely the compiler will inline flag_idx and reduce the above assembly even futher when static values are used.

We can use this to store atttribute data packed in an array. If bit i is set in flags, then the data corresponding to bit i will be found in

attr_data[flag_idx(flags, 1 << i)]

We started with the assumption that all attributes had data associated with them. We can now remove that assumption by masking off only those flags which have data associated. For example if DATA_MASK is a mask of a bits whose attributes have data, we can locate the data by adapting the above to

attr_data[flag_idx(flags & DATA_MASK, 1 << i)]

A worked example of lookup

This example shows a mocked up version of the motivating example. In real life this would be encoded differently – it’s just intended to show the idea. There are various ways the attribute data could be stored; here we use an index into a global array to indicate where to start looking for attributes, if there are any.

We start by defining the data types and tables, as well as the flags we use to mark presence of the different attributes (whether the procedure should be force inlined, whether it has a non-default section, and whether it has a non-default ABI). Only two of these have any associated data which is captured by ATTR_DATA_MASK.

struct global {
	const char  *name;
	uint32_t    attr_start_idx;
	uint32_t    flags;
};

struct attr {
	union {
		const char  *section;
		const char  *abi;
	};
};

#define ATTRS_MAX       64
#define GLOBALS_MAX     32

struct global globals[GLOBALS_MAX] = { 0 };
struct attr   attrs[ATTRS_MAX] = { 0 };
int global_count = 0;
int attr_count = 0;

enum {
	FLAG_FORCE_INLINE,
	FLAG_SECTION,
	FLAG_ABI,
};

static const uint32_t ATTR_DATA_MASK = FLAG_SECTION | FLAG_ABI;

Now lets put some examples in the tables (encoded by hand).

/* Example illustrating the default case - no attributes. */
{
		struct global *g    = &globals[global_count++];
		g->name             = "basic";
}

/* Two attributes. */
{
	struct global *g    = &globals[global_count++];
	g->name             = "interrupt_handler";
	g->flags            = FLAG_ABI|FLAG_SECTION;
	g->attr_start_idx   = attr_count;

	attrs[attr_count++].section     = ".text.interrupt";
	attrs[attr_count++].abi         = "interrupt abi";
}

/* No attributes with data. */
{
	struct global *g    = &globals[global_count++];
	g->name             = "inline";
	g->flags            = FLAG_FORCE_INLINE;
}

/* Single attribute with data. */
{
	struct global *g    = &globals[global_count++];
	g->name             = "custom_abi";
	g->flags            = FLAG_ABI;
	g->attr_start_idx   = attr_count;

	attrs[attr_count++].abi = "custom abi";
}

Finally we loop over the globals, and show how to retrieve data from the attributes (here we just look at the ABI). Note in particular that the “interrupt handler”, and the “custom abi” globals will have the abi data stored at a differed offset from the start index in the attribute array (1 and 0 respectively).

printf("globals: %d, attrs: %d\n", global_count, attr_count);

for (int i = 0; i < global_count; ++i)
{
	const struct global *g = &globals[i];

	printf("%s: ", g->name);
	if (g->flags & FLAG_ABI)
	{
		uint32_t attr_idx = g->attr_start_idx + flag_idx(g->flags & ATTR_DATA_MASK, FLAG_ABI);
		printf("%s\n", attrs[attr_idx].abi);
	}
	else
	{
		printf("default abi\n");
	}
}

The following is the output:

globals: 4, attrs: 3
basic: default abi
interrupt_handler: interrupt abi
inline: default abi
custom_abi: custom abi

Note that we’ve only used three attribute slots – exactly the number of attributes with associated data – and that whatever the number of attributes used for a global, only 64-bits is consumed in the main structure. We only look in the attributes array when there is data actually present there that we need to read.

It’s also worth noting that the “type” of the attribute is determined precisely by the set bits in flags.

Addendum: Packing/unpacking of small sparse arrays with flags

A useful utility for packing/unpacking a sparse array into a packed/unpacked array uses tzcnt on x86_64, accessed through the __builtin_ctz builtin.

#define ctz32   __builtin_ctz

int
indices_by_flags(uint32_t flags, uint8_t index[32])
{
	int count = 0;

	for (uint32_t live = flags; live != 0; live &= live - 1)
	{
		index[count++] = ctz32(live);
	}

	return count;
}

indices_by_flags returns the number of bits set in flags, and the ith slot of the indices array is populated with the ith bit number.

The following example shows how this can be used

{
	uint8_t indices[32];
	int count = 0;
	count = indices_by_flags((1 << 3) | (1 << 4), indices);
	assert(count == 2);
	assert(indices[0] == 3);
	assert(indices[1] == 4);

	/* Creating a packed version of an array. */
	for (int i = 0; i < count; ++i)
	{
		packed[i] = unpacked[indices[i]];
	}

	/* Unpacking a packed array. */
	for (int i = 0; i < count; ++i)
	{
		unpacked[indices[i]] = packed[i];
	}
}

where packed and unpacked are suggestively named variants of some array-of-structures type.