Designing a symbol table

December 25, 2025

Introduction

Compilers are full of strings, and resolving entities by name winds up being a common operation. This in turn entails comparing strings for equality, which if done naively, is not especially efficient. In order to avoid doing byte-for-byte comparisons between strings, compilers (and interpreters) often use a symbol table, a data structure which maps each string (that the compiler is likely to want to compare) to some canonical copy of the string. This canonical copy can then be represented by a pointer, and string comparison (for these “interned” strings) then reduces to pointer comparison. This mapping of strings to a canonical copy is called interning, and the interned strings are referred to as symbols.

Pointer representations (on most modern hardware) consume 64-bits. Since symbols are referred to in many places in compiler datastructures, and 32-bits is more than enough to count all symbols, its useful to be able to compress symbols to a 32-bit id. This helps reduce the size of commonly occuring data, and improve cache line utilization. One way to do this is to to give each interned symbol an id instead of, or as well as, a stable canonical pointer, and refer to it using this id instead. When the symbols raw bytes are needed, they can be obtained from the symbol table.

My commpiler evolved from having symbols represented by pointers, to later using 32-bit ids. To get things working, this was achieved by having a symbol table which, logically had the following interface

string *symbol_table_intern(int length, const uint8_t *bytes);

and then layering on top of this another table, mapping from uint32_t to string *. This in particular means that symbol creation requires two table insertions, and obtaining raw bytes also requires a table lookup.

By the time I got to simplifying this system a little, I had a reasonable sized test program which I could obtain data from and analyse, providing a concrete basis for the design of a new table.

Analysing the existing symbol table

My test program interns just shy of 3500 symbols - not a ridiculous amount but enough to get an idea of the kinds of data a symbol table is dealing with. The existing symbol table is at its core essentially a hash table (designed and specialized for this use case). Symbol interning with the example program triggered table resize several times (meaning interning is happening with both low and high table load factors).

Profiling the compiler shows that the compiler spends around 4.5% of its time indexing symbols, of which around 2.5% is spent on the initial interning (and the rest spent mostly on the second layer of indexing).

A new symbol table which deals with symbol ids directly ought to be able to reduce the cost to something comparable to the existing pointer based interning.

The new table will still be fundamentally a hash table, so we do some analysis of the existing table to try and pick a good data layout. For the most part we want to minimise cache misses, since memory loads are where most of the time interning is spent.

The existing table comprises an array of 32-bit integers alongside an array of pointers for the canonical copies of each string. Each of the 32-bit integers uses a single bit to mark a slot as in use, with the other 31-bits being used for 31-bits of the symbol hash. The allocation of the raw string data (i.e. the value stored in the array of pointers) is at least 2-byte aligned, so the remaining 1-bit of the hash is smuggled into the lower bit of this pointer when its stored in the pointer array – the table operations OR this in on write and mask it out when recovering the pointer. This means the hash can be completely reconstructed, which helps make table resizing more efficient (hashes don’t need to be recomputed). Knowing whether the slot is empty helps with searching the table – the pointer array only needs to be accessed in the (rare) case of both table slot collision and (31-bit) hash collision. Keeping all this in 32-bits (and not, say, allocating an extra byte for it) helps increase the cache-line density.

Interning a symbol involves iterating from the index given by the masked hash, comparing 31-bits of the hash and the present bit for plausibly equal strings already interned (and checking the full string when these match). If an existing string matches what’s being interned, its pointer is returned; otherwise, a new copy of the string is allocated in an arena which provides it with a stable pointer address.

Instrumenting the old table, we can look at the chain lengths (number of iterations until we find a matching string or empty slot), and estimate the expected chain lengths and probability of incurring cache line misses on lookups with different data layouts. This analysis helps us design something which is simple and performs reasonably well.

The table below summarises the samples.

Chain length	Frequency	Probability
0	2798	0.8014
1	324	0.0928
2	123	0.0352
3	98	0.0280
4	24	0.0068
5	34	0.0097
6	15	0.0042
7	16	0.0045
8	11	0.0031
9	12	0.0034
10	11	0.003
11	6	0.0017
12	6	0.0017
13	2	0.0005
15	2	0.0005
16	1	0.0002
17	1	0.0002
21	1	0.0002
22	4	0.0011
24	1	0.0002
25	1	0.0002

(totalling 3941 samples).

This is a little easier to visualize if we plot the chain length against the probability of that chain length occuring based on our dataset.

Plot showing chain length v.s. probability of occurring

Some of the chains get quite long – it would be good to bound the worst case a little better (for example, by using a different probing scheme or a Robin Hood style invariant), but for the moment, we’ll keep to simple linear probing.

Assuming this is a reasonable approximation of the actual probability distribution for chain lengths, we can calculate that the expected chain length for this kind of table is around 0.58 – most lookups will find what they are looking for at the first index they try.

Our existing array of hash bits uses 32-bits per-entry, from which it’s also possible to calculate the probability that any lookup iterates outside the cache line it initially indexes into. With 16 slots per cache line (64/4), assuming our masked hash lands with uniform probability in any of these slots, with independent probability distribution of chain lengths, we need to compute

P(outside first cache line) = P(first slot) * P(chain length >= 16) + ... + P(last slot) * P(chain length >= 1)

This turns out to be around a 0.035 probability for our 32-bit entries.

One key consideration with hash-table design is how to store the logical keys and values. Storing key-value pairs leads to more cache misses as values grow in size, suggesting keeping values in a separate array with correlated indices (which will likely incure a cache miss, but better than several searching for a key). However, in our upcoming design, our values will be the 32-bit symbol id. Supposing we were to store the hash and symbol id as a 64-bit composite, we can ask how the probability of missing the first cache line changes with iterating along 64-bit values. It turns out that here our probability is 0.65, which is almost twice as likely as the 32-bit case, but still reasonably small. This suggests keeping small values alongside keys won’t radically increase the probability of cache misses in a way that impacts performance.

Profiling the compiler, it’s worth mentioning that interning is mostly called in the parsing stage as part of resolving references. Given most procedures reference local variables that they themselves define, there is some temporal locality to symbol interning. Once a local variable has been interned once, it’s quite likely to be interned (but find an existing entry) again, meaning many cache misses on initial interning will be amortized over future lookups. A small probability of hitting extra cache lines is probably therefore not very impactful to overall throughput.

A new table

Our new table will have a slightly different API. Something of the form:

typedef uint32_t symbol_id;

symbol_id symbol_table_intern(int length, const uint8_t *bytes);

Given we get to assign the numbers for each symbol id, it would be ideal if we could make the symbol string lookup very straightforward by using the id as an array index (and avoid the silly table lookup we had previously). This meshes with the tried and true id allocation scheme of just incrementing an integer.

struct byteslice
{
	size_t length;
	const uint8_t *bytes;
};

struct symbol_table
{
	struct byteslice *symbol;
	...
};

static inline struct byteslice
symbol_table_symbol_bytes(struct symbol_table *st, symbol_id s)
{
	return st->symbol[s.id];
}

For the actual symbol table structure, we want to maintain a map between the string contents and the symbol id. In the original table having the hash to hand was useful for both resizing the table, and for doing a fast reject of interned strings which won’t match. If we need to check a string more precisely by checking the length and bytes, having the symbol id available allows us to find the string in the symbol array above by direct indexing.

Our analysis of the original table gave us data suggesting that using 64-bit locator keys in the table shouldn’t impact cache miss rate too significantly v.s. 32-bit locator keys. To that end our table will be backed by an array of uint64_t elements consisting of a symbol id (in the lower 32-bits) and hash (in the upper 32-bits).

struct symbol_table
{
	struct byteslice *symbol;

	uint32_t    mask;
	uint32_t    used;
	uint64_t    *locator;
};

Instead of reserving a bit in every entry to mark the presence of a symbol (which avoided looking at the symbol pointer in a different array), we reserve symbol_id 0, and pre-define it as the empty string. Any string of length 0 can be matched to this, and we can avoid recording it in the locator array at all; the table is used for deduplication, not for looking up the raw bytes. This means we can use a zero symbol as an indicator that a slot is empty in the locator array.

We use the used field to allocate symbol ids. When the table is initialized, this will be set to 1 to account for the null symbol, despite this not being present in the hashing part of the table. It also tracks the number of entries in the table (off-by-one to account for the null symbol), so is used to detect the need to resize.

Resizing is always to power-of-two sized arrays, which lets us mask hashes to valid indices for the locator array. This capacity is always an upper bound for the number of interned strings, so we can also resize the symbol array on table resize to avoid needing to check capacity when adding new symbols. The capacity is stored in the form of mask, which is one less than the capacity, and the way capacity is used on the fast path.

The complete structure also contains an arena for allocating the actual symbol data, but this is a minor detail.

The following shows how interning works with this structure. Some of the functions are not defined, but it should be clear enough what they do.

symbol_id
symbol_table_intern(struct symbol_table *st, int length, const uint8_t *bytes)
{
	if (length == 0)
	{
		return 0;
	}

	assert((float)(st->mask + 1) * LOAD_FACTOR >= st->used);

	uint32_t hash   = strhash(length, bytes);
	uint32_t i      = hash & st->mask;

	for (;;)
	{
		uint32_t symbol = st->locator[i];

		if (symbol == 0)
		{
			break; /* Slot i is empty. */
		}

		const struct byteslice *raw = &st->symbol[symbol];

		if (hash == (st->locator[i] >> 32) && raw->length == length && !memcmp(raw->bytes, bytes, length))
		{
			return (uint32_t)st->locator[i];  /* symbol is already interned. */
		}

		i = (i + 1) & st->mask;
	}

	/* At this point, slot i is empty, and we didn't find the string;
	 * allocate a new one. */

	st->locator[i] = (uint64_t)hash << 32;
	st->locator[i] |= st->used;

	struct byteslice *raw = &st->symbol[st->used];

	raw->length     = length;
	raw->bytes      = intern_bytes(st, length, bytes);  /* Copy into arena. */

	symbol_id key = st->used++;

	if ((float)(st->mask + 1) * LOAD_FACTOR < st->used)
	{
		resize(st);
	}

	return key;
}

End results

The new symbol table makes the additional index obsolete. It reduces the cost of the old interning system – which took around 4.5% of the runtime of the compiler – to around the cost of the symbol interning step alone (it’s now about 2.5% of the runtime of the compiler). Indeed profiling numbers suggest it might be slightly faster (an ever so slightly smaller percentage of a overall decreased runtime).

Additionally, converting symbol ids to raw bytes is now just an array index rather than a hash table lookup, so this is much simpler (and faster too). This didn’t really show up prominantly in the profile of the program (it’ll be low cost versus a lot of the other fatter processes), but indexing this array should translate to a single instruction, so can’t really be made much cheaper (and avoids some of the worst cases of hash-tables where longer chains form).

Doing the analysis has also suggested some other ways symbol interning might be improved.

Future directions

I’d like to benchmark the performance of both tables (the original without the index) to get a better idea of which actually performs better in more detail. A benchmark would also allow monitoring harware events to see if there is any measurable difference in e.g. cache misses.

In terms of making a better table, there are a few strikingly obvious things that can be investigated.

The first is trying to reduce the maximum chain length on a lookup. There are various ways to do this, but some obvious ones are trying a different probing strategy (by using for example, the unused bits of the hash to offset into a new index of the table), or by imposing a new invariant such as in Robin Hood Hash Tables (keys with shorter chain lengths get shunted along in the table, meaning chain length gets shared amongst more keys).

The second is to try something more like a swiss table - this makes longer chains much faster to probe.

This is a decent simple implementation for now, which is exactly what I want at the core of my compiler. Further improvements don’t need to happen immediately!