Memory models and lock-free programming

July 19, 2020

A pithy a terse account of memory models and lock-free programming. Further references are linked. Currently work-in-progress.

Let's start by describing the high-level concerns of atomic variables, as they appear in relatively high-level system programming languages (function names are taken from the C11 API in stdatomic.h). We'll examine why these concerns arrive at the hardware level later, as well as various ways of thinking about them, before coming back to the APIs exposed in higher-level languages with examples on how to use them.

The three pillars of lock-free programming

It's useful to break thinking about atomic variables into three different parts. Properties of the variables themselves, properties of the operations on those variables, and the relative ordering of those operations with respect to other operations in code.

One thing worth noting is that in order for read-modify-write operations to be well-defined, and for memory ordering to be well-defined (in particular, when a write becomes visible in the cache system), basic atomicity (i.e. non-torn reads and writes) is required.

The following mentions some specifics of hardware, which are discussed in more detail later. For this section it is sufficient to bear in mind that

  • the cache system is the source of truth for the value stored in a particular memory address (more precisely, if an address is in cache, then the cache knows the value at that address, otherwise the value at that address is found in memory)
  • the store buffer buffers writes from the CPU to the cache heirarchy, hence introducing latency between a write and the time at which that write is visible to other threads
  • the store buffer may flush writes in a different order to which they occur (it may appear to reorder writes)

Atomic variables

An atomic variable on its own only guarantees that reads and writes will not be torn. If a thread reads an atomic variable, then the value that it reads will have been written by a thread at some point in time. A 32-bit machine may for instance not be able to write a 64-bit variable atomically (but may instead decompose it as two writes, one for the lower bits and one for the upper).

Note that an atomic variable is just a position in memory, so at some level these guarantees have to be supported by the hardware and instruction set. x86_64 for instance guarantees certain writes (e.g. 8-byte aligned 64-bit writes) are atomic, while it does not guarantee others (e.g. writes that cross a cache-line).

Note that this doesn't say anything about when e.g. a write becomes the coherent state of the cache heirarchy - that is a property of the operation, possibly influenced by the memory ordering constraints (in particular, sequential consistency).

Operations on atomics

A store to an atomic variable need not be any different than a store to any other variable. A store may sit in the store buffer for a while before being propagated and commited in the cache heirarchy (note the default store memory order memory_order_seq_cst will guarantee that the store is propagated all the way to cache, because the memory order requires it, but weaker memory orderings need not. Even so, this need not be implemented as part of the store instruction, indeed for x86_64, gcc version 9 and below implemented this by following the store with an mfence to allow the store buffer to flush. By contrast gcc version 10 implements sequentially consistent atomic stores using lock xchg. lock xchg is allegedly faster, even though it waits for the store buffer to flush).

Atomic variables however come with special read-modify-write operations, which are able to modify the value commited to cache. The simplest of these is just an exchange of value (atomic_exchange , atomic_exchange_explicit). These instructions need to be backed by hardware level instructions; for atomic_exchange, x86_64 compilers makes use of the xchg instruction (otherwise known as lock xchg) which locks the memory bus (or, in practice, more likely a single cache line) for the duration of the operation to prevent a mid-operation write.

The fact that these operations affect the value in cache is a property of the read-modify-write (RMW) instructions themselves, and this need not be true of vanilla loads and stores. Moreover, RMW operations are still vulnerable to reordering, at the very least at compile time.

Memory ordering

Compilers will reorder operations to try and optimize code. Processors will execute operations out-of-order, and the store buffer is not necessarily guaranteed to flush stores to cache in the same order in which the stores occured.

This means that, without further specification, instructions on atomics may appear from the perspective of other threads to be executed in a different order to the instruction stream. This is often undesirable - for instance, we want to guarantee operations in a critical section inside a lock stay inside the critical section, and aren't, say, reordered after the release of the lock.

Memory orderings specify how the operation on an atomic variable appears relative to other operations in the same thread, from the perspective of another thread or the cache system. It's important to realise that this is all that they do. Nonetheless, this is where most of the subtlety of using atomics lies.

Hardware level

We start by describing typical hardware, to understand how a CPU interacts with memory, and to help understand why some writes may appear to be reordered from another threads perspective.

Cache coherency

Overview

Caches already include protocols (MESI being the prototypical one, MESIF for multi-socket systems) for ensuring that caches for different cores are kept consistent.

Caches for distinct cores snoop on each others memory accesses. In order for a cache to accept a write to a cache-line it will negotiate exlusive access with the other caches. If a read from another core is needed to a polluted cache-line, the caches must synchronize. Any data which is requested via the cache system will be synchronized and coherent, so e.g. a read from one CPU will look like a read from another CPU when the write which wrote that value has propagated to cache.

Caches respond to bus events, but won't do so immediately, instead maintaining an invalidation queue, processing the events in this queue when it has time/is no longer busy. This means that the caches will converge on being synchronized with time, but not necessarily immediately. This manifests as delay reading from cache from the CPU's point of view, since the cache will always provide coherent data.

Multiple sockets and socket interconnect

The above is true across sockets also, but there is an increased cost to synchronizing across sockets. Where latency in the cache system needs to be kept low, CPUs should stick to cache-lines in their local memory.

Store buffers (and load buffers)

Since the cache is relatively slow (compared to what can be done in registers), processors have store buffers to buffer stores to the L1 cache, and avoid paying the cost of waiting for the cache to accept a write. Buffered stores may not propagate to the cache in execution order, and hence the other cores may see stores in a different order to what the program order suggests. This, along with the fact that the CPU may also be executing instructions out-of-order motivates the need for explicit fences or memory ordering constraints.

Memory ordering constraints may require some flushing of the store buffers (technically, waiting for the buffers to empty), and hence may require state changes to the cache coherency mechanism to e.g. negotiate exclusive access to a cache-line to perform a store.

The degree of constraint here varies. Sometimes we only care what order memory operations are visible to other cores, but not precisely when. Other times we want to enforce a strict consistent global ordering on modifications of an variable.

Load buffers seem to be invisible from the application perspective. There is an lfence instruction on x86_64 which only completes when all prior loads from memory have completed. This is only really needed for "weakly-ordered memory types", for example, a non-temporal load from memory which circumvents the cache-hierarchy.

Write combining buffers

When a write from a CPU needs to be pushed to L1 cache, but a cache miss occurs, then a processor may use a cache-line sized buffer to buffer the write to L2 cache. These buffers also combine subsequent writes to the same cache-line.

This article explains and demonstrates them well. Note that these buffers sit between the store buffer and L2 cache. They also seem to be used for non-temporal writes which avoid the cache-heirarchy and write directly to memory.

Write combining buffers are limited in number, and shared by hyper-threaded cores.

Memory models and memory ordering

This section talks about the different memory orderings, and how they give different memory models.

This section outlines the memory model and orderings available in C and C-like languages (C++, Rust etc.). Each of these orderings need to be respected by the compiler, and backed by the hardware at some level. To provide this, an individual instruction set may provide guarantees on reads and writes, or particular instructions with enhanced guarantees. For example, a load using mov on x86_64 is guaranteed to obey acquire semantics with respect to surrounding instructions, equally a store using mov on x86_64 is guaranteed to obey release semantics with respect to surrounding instructions.

Remember that the various memory orders only impact how operations can be reoordered around atomic operations. Allowing more opportunities for reordering opens up more opportunites for optimizations, but also more opportunity for bugs.

Acquire and release semantics (memory_order_acquire and memory_order_release)

Acquire semantics - No loads or stores on a thread/CPU can be reordered before a load operation with acquire.

Release semantics - No loads or stores on a thread/CPU can be reordered after a store operation with release.

Reordering here is with respect to visibility in the cache subsystem for the core (i.e. visible in memory, since the cache subsystem is authoritative for the state of memory).

The example below shows how these memory orderings can be used to synchronize data across threads.

The x86_64 architecture guarantees acquire semantics on mov from memory, and release semantics on mov to memory. Put another way, all loads with mov will appear in the same order in the cache system as they occur in the machine code, and stores with mov will occur in the same order as they appear in the machine code.

To confuse matters, atomic_thread_fences provide slightly stronger versions of the acquire and release guarantees (The above semantics are defined for load and store operations, which fences are not). See the section on fences below.

Example

// Compile with e.g. gcc FILE -pthread -o PROGNAME
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>

#define CHECK(cond) do { if (!(cond)) { fprintf(stderr, "check failed: " #cond "\n"); abort(); }} while (0)

_Atomic(bool) guard = false;
uint64_t payload = 0;

int
writer(void *arg)
{
    payload = 100;
    atomic_store_explicit(&guard, true, memory_order_release);
    return 0;
}

int
main(int argc, char *argv[])
{
    thrd_t child;
    CHECK(thrd_create(&child, writer, NULL) == thrd_success);

    while (!atomic_load_explicit(&guard, memory_order_acquire)) {}
    printf("Received %ld\n", payload);

    CHECK(thrd_join(child, NULL) == thrd_success);
    return 0;
}

Here the store to guard with memory_order_release prevents the prior write to payload from being reordered after the write to guard. Similarly, the read of guard with memory_order_acquire, prevents the subsequent read of payload from being reordered before the read of guard.

In sum, this means that at the point the main thread reads guard as true, the thread executing writer has guaranteed the new value of payload visible, and since the read of payload in main must occur after that, the read will read the new value.

Consume (memory_order_consume)

Warning: Very few (if any) compilers implement memory_order_consume without just upifting to memory_order_acquire, so it's probably safer to stick with memory_order_acquire. Moreover, C++ (and hence possibly C) looks like it might redefine what memory_order_consume means in future. On architectures where data dependency is sufficient, acquire semantics are too strong, and of course there is a need for the performance, it's probably sensible to fall back to assembly (consulting the instruction set to understand how dependencies are carried).

Data which is loaded with a consume memory ordering provides slightly weaker guarantees than data read with acquire semantics (note that x86_64 usually provides acquire semantics as part of its memory model, so consume semantics don't add anything).

Consume semantics - no load on a thread/CPU dependent on the value loaded with consume, can be reordered before the load with consume.

The notion of "depending on" also needs to be made precise. The following is more or less verbatim from the C++ standard library documentation.

Carries a dependency/depends on - A statement B depends on a statement A (A carries a dependency into B) if an only if A is sequenced before B and one of

  • the value of A is used as an operand of B, except when B is a call to kill_dependency or A is the left operand of &&, ||, ?: or , operators
  • A writes to a scalar object M, B reads from M
  • A carries a dependency into an evaluation X, and X carries a dependency into B.

(The first condition here makes it unclear how dependencies are carried across function boundaries - note that C++ provides a carries dependency attribute that can be used to help manage this. The examples from the kill_dependency macro suggest compilers are conservative across function call boundaries, and without the carries_dependency attribute will emit an acquire fence).

A consume operation ensures that if the load sees a store with release from another thread, than all stores preceeding the store-with-release will be visible to any operations which depend on the value read with the load-consume.

The following example is archetypal.

Example

// Compile with e.g. gcc FILE -pthread -o PROGNAME
#include <stdatomic.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>

#define CHECK(cond) do { if (!(cond)) { fprintf(stderr, "check failed: " #cond "\n"); abort(); }} while (0)

_Atomic(uint64_t*) guard = NULL;
uint64_t value = 0;

int
writer(void *arg)
{
    value = 100;
    atomic_store_explicit(&guard, &value, memory_order_release);
    return 0;
}

int
main(int argc, char *argv[])
{
    thrd_t child;
    CHECK(thrd_create(&child, writer, NULL) == thrd_success);

    uint64_t *p = NULL;
    while((p = atomic_load_explicit(&guard, memory_order_consume)) == NULL) {}
    uint64_t v = *p;

    printf("Received %ld\n", v);

    CHECK(thrd_join(child, NULL) == thrd_success);
    return 0;
}

In main, p is loaded with consume semantics, and will be loaded as non-null when the store to guard in writer propagates to this thread.

The dereference of p into the value v in main depends on the value loaded into p, so won't be reordered with the atomic_load_explicit of p, hence the dereference of p to v must see all memory operations that happened before the store to guard in writer. In particular, it should see the write to value.

Running this code should reliably ouput Received 100 to screen.

Note that the constraint here is strictly weaker than an acquire memory ordering. On x86_64 this has no real effect, but on platforms with weaker memory models (ARM for instance), it can remove the need for relatively expensive synchronization instructions. No compiler as of yet does this, so at best memory_order_consume expresses intent.

memory_order_acq_rel acquire-and-release

For a read-modify-write operation it can be desirable to have acquire semantics for the read and release semantics for the write. memory_order_acq_rel provides this.

The simplest example here is a simple locking structure. See the spinlock example below.

Sequential consistency (memory_order_seq_cst)

Sequential consistency ensures a globally consistent write order amongst all atomic variables written with sequential consistency, as well as acquire and release memory orderings.

On x86_64 this is stronger than a standard mov will provide, because the thread doing the store also needs to guarantee that the write is globally visible before continuing. This is done by either following the mov by an mfence or using the locked instruction xchg.

Typically this is less performant than simple acquire and release, since it requires the operation to be committed to cache in order to make the operation globally visible. It is however, the safest and the default.

Relaxed orderings (memory_order_relaxed)

Relaxed ordering provides no guarantees on the reordering of an operation on a atomic with other operations around it.

Fences v.s. atomic operations with explicit memory orders

Acquire semantics (thread fence version) - No loads or stores on a thread/CPU can be reordered before any load before a fence with acquire.

Release semantics (thread fence version) - No loads or stores on a thread/CPU can be reordered after any store after a fence with release.

These guarantees are slightly stronger than thos provided on a single operation, since they impact all operations preceeding/proceeding the fence, not a single annotated load/store operation.

Preshing's article on acquire and release fences explains this well.

References

Read-Modify-Write (RMW) operations

This section contains a few additional notes on read-modify-write operations.

False sharing

Atomic read-modify-write operations to two distinct variables sharing a cache-line may lead to unexpected contention.

For example, on x86_64 locked instructions will often lock the cache-line of the variable (when the operation is writing to an address contained entirely withing a cache-line - locked instructions across cache-line boundaries are more expensive anyway). This means that if multiple atomics are packaed into a single cache-line, and concurrent read-modify-write operations are executed on them, then concurrency will be limited by the ability to lock the cache-line. In other words, it's not advantageous to pack multiple atomic variables upon which you expect to execute read-modify-write operations into an array, for example.

The problem of accidently causing write/locking contention on multiple shared variables sharing a single cache line is known as false sharing.

Simple examples

Simple lock-free single-producer-single-consumer queue

Check orderings, and write a decent explanation

The following is a single-producer-single-consumer queue, with a small driver program that punts integers backwards and forwards, incrementing and decrementing them. It runs indefinitely.

#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>

#define CHECK(cond) do { if (!(cond)) { fprintf(stderr, "check failed: " #cond "\n"); abort(); }} while (0)

#define QUEUE_LENGTH 4096
#define QUEUE_MASK   (QUEUE_LENGTH - 1)
#define CLAMP(offset) ((offset) & QUEUE_MASK)

typedef uint64_t item;

typedef struct
{
  item queue[QUEUE_LENGTH];
  _Atomic(size_t) head;
  _Atomic(size_t) tail;
} spsc_queue;

typedef enum
{
  SPSC_OK,
  SPSC_FULL,
  SPSC_EMPTY
} spsc_status;

spsc_status
spsc_try_queue(spsc_queue *const queue, item item)
{
  size_t head = atomic_load_explicit(&queue->head, memory_order_acquire);
  size_t tail = atomic_load_explicit(&queue->tail, memory_order_relaxed);

  // If head and tail are different but mask to the same index, then
  // the queue is full
  if (CLAMP(head) == CLAMP(tail) && head != tail)
  {
    return SPSC_FULL;
  }

  queue->queue[CLAMP(tail)] = item;
  atomic_store_explicit(&queue->tail, tail + 1, memory_order_release);

  return SPSC_OK;
}

spsc_status
spsc_try_unqueue(spsc_queue *const queue, item *item)
{
  size_t tail = atomic_load_explicit(&queue->tail, memory_order_acquire);
  size_t head = atomic_load_explicit(&queue->head, memory_order_relaxed);

  if (head == tail)
  {
    return SPSC_EMPTY;
  }

  *item = queue->queue[CLAMP(head)];
  atomic_store_explicit(&queue->head, head + 1, memory_order_release);

  return SPSC_OK;
}


spsc_queue out = {0};
spsc_queue in = {0};

int
processor(void *args)
{
  item current = 0;

  while (true)
  {
    while (spsc_try_unqueue(&out, &current) != SPSC_OK) {}
    ++current;
    while (spsc_try_queue(&in, current) != SPSC_OK) {}
  }
}

int
main(int argc, char *argv[])
{

  for (item ix = 0; ix < QUEUE_LENGTH; ++ix)
  {
    CHECK(spsc_try_queue(&out, ix) == SPSC_OK);
  }

  thrd_t worker;
  CHECK(thrd_create(&worker, processor, NULL) == thrd_success);

  item current = 0;
  while (true)
  {
    while (spsc_try_unqueue(&in, &current) != SPSC_OK) {}
    --current;
    while (spsc_try_queue(&out, current) != SPSC_OK) {}
  }

  CHECK(thrd_join(worker, NULL) == thrd_success);

  return 0;
}

Simple lock (spinlock)

Review, especially the last part on exchange

A simple locking structure is interesting to examine. We discuss variations below.

// Compile with e.g. gcc FILE -pthread -o PROGNAME
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>

#define CHECK(cond) do { if (!(cond)) { fprintf(stderr, "check failed: " #cond "\n"); abort(); }} while (0)

#define ITERATIONS 10000000

typedef enum
{
  SPINLOCK_BUSY,
  SPINLOCK_TAKEN
} spinlock_status;

typedef struct
{
  _Atomic(bool) locked;
} spinlock;

static inline spinlock_status
spinlock_try_lock(spinlock *const lock)
{
  const bool busy = atomic_exchange_explicit(&lock->locked, true, memory_order_acq_rel);
  return busy ? SPINLOCK_BUSY : SPINLOCK_TAKEN;
}

static inline void
spinlock_unlock(spinlock *const lock)
{
  atomic_exchange_explicit(&lock->locked, false, memory_order_acq_rel);
  return;
}

static uint64_t counter = 0;
static spinlock lock = {0};

int
writer(void *args)
{
  for (size_t ix = 0; ix < ITERATIONS; ++ix)
  {
    while(spinlock_try_lock(&lock) != SPINLOCK_TAKEN) {}
    counter++;
    spinlock_unlock(&lock);
  }
}

int
main(int argc, char *argv[])
{
    thrd_t w1, w2;
    CHECK(thrd_create(&w1, writer, NULL) == thrd_success);
    CHECK(thrd_create(&w2, writer, NULL) == thrd_success);
    CHECK(thrd_join(w1, NULL) == thrd_success);
    CHECK(thrd_join(w2, NULL) == thrd_success);

    printf("Counter is %ld\n", counter);

    return 0;
}

Let's begin by examining the atomic_exchange_explicit in the spinlock_try_lock function, which tries to set the locked flag, returning a status code depending on whether this was successful.

Exchange is a read-write-modify operation, which makes sense for setting the lock flags (we want to read and write in a single step). We don't want load or store instructions inside the locks critical section to be reordered before the exchange where we take the lock, so the exchange should be at least memory_order_acquire. In fact, this is sufficient for correctness. In principle, we don't mind stores prior to the exchange being reordered after the exchange, but this increases the number of instructions executed in the critical section.

Usually, however, we want to keep the critical section as compact as possible, hence it is desirable for the exchange to also be an acquire, hence memory_order_acq_rel.

Now let's take a look at the unlock operation, spinlock_unlock. Similarly to the locking procedure, we don't want loads and stores inside the critical section to be reoordered after the exchange, hence we need at minimum memory_order_release. As above, we also don't want loads and stores after the critical section being reoordered into the critical section, to keep the critical section work minimal, hence we also add an acquire, hence memory_order_acq_rel.

An interesting optimization that can be made is in the locking function. The key observation is that we use a locked RMW operation to assess whether we can take the lock, as well as to release it. It's possible this operation contends with the thread trying to release the lock. A simple optimization here is to check the lock variable first with a simple load, only commiting to the more expensive RMW operation if the load observes the lock was free to take (note that this doesn't mean it will still be free at the point of exchange, so we need to check again). This is known as the "double-checked locking pattern". More explicitly

static inline spinlock_status
spinlock_try_lock(spinlock *const lock)
{
  if (atomic_load_explicit(&lock->locked, memory_order_relaxed))
  {
    return SPINLOCK_BUSY;
  }

  const bool busy = atomic_exchange_explicit(&lock->locked, true, memory_order_acq_rel);
  return busy ? SPINLOCK_BUSY : SPINLOCK_TAKEN;
}

This seems to give an approximately 30% speedup on my system in the above (essentially maximally contended) example.

Note that the first load can be relaxed - we only use the locked flag to early return and avoid doing the more expensive synchronization operation.

Can unlock just store-release?

In principle we might not need to use a read-modify-write instruction for unlocking the lock, if we always access counter through the lock (note that we don't do this in main). The minimum we seem to need is an atomic_store_explicit with memory_order_release. I suspect this may not lead to a predictable lock, but I need to investigate more. There are a some potential advantages to using the read-modify-write operation though.

First, we may wish to use the lock only when there multiple threads trying to access the protected data, instead accessing it directly when we know there is only a single thread running (as we do here with counter in main in the above, after joining the worker threads). We want to guarantee that single-threaded access sees the up-to-date protected data.

Let's examine what would happen in the example above if the spinlock_unlock procedure used an atomic_store_explicit with memory_order_release. In particular, we want to know what counter looks like in main at the printf line. Note that there is no guarantee that the final write to counter will be visible from main.

If in main we were to take the lock before reading counter we would read the most up-to-date value, because the lock would read the locked atomic with acquire semantics, and since the locked atomic was written with release semantics, the last write of counter must be visible to main.

But in fact we can do without taking the lock, because the read-write-modify operation to release the lock guarantees that the cache system sees locked set to false at the end of the RMW instruction, and hence, because of the memory_order_release on that operation, the write to counter must also be visible in the cache subsystem. The cache is authoritative, so other threads must subsequently read the correct most up-to-date value for counter.

In practice, joining the threads probably allows the store buffer to flush, so the up-to-date value will be visible anyway.

Secondly, its possible (measurement required!) that using the read-modify-write operation will provide a fairer lock.

The above example completes faster with unlock using atomic_store (about half the time with -O2). This is likely because this example is a tight loop around taking and releasing the lock. Probably the writer is able to read the false value straight from its own store buffer before it's visible to the other thread at all, at the same time as locking the cache line. Adding some basic logging suggests this might be the case, and that the exchange operation ends up distributing the lock more fairly than the store.

A few more examples would be needed to get a good picture of the tradeoffs - the above test code is essentially completely serial.

Hazard-pointer pair (single-producer-single-consumer)

Check memory orderings in code, and explain properly

A hazard pointer is a structure for managing pointers across threads. It allows a writer thread to set a pointer, and determine when it is safe to free the old value (i.e. the consumer is no longer using it).

The below example does require the producer and consumer to adhere to some basic rules to ensure memory isn't leaked, or freed before it's safe. In particular, the consumer thread should always make sure the latest value obtained from hazard_deref is the one which is used, and all older values obtained from it should be considered stale. Furthermore, a hazard_load should only be considered complete once hazard_loaded returns true.

// Compile with e.g. gcc FILE -pthread -o PROGNAME
#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>

#define CHECK(cond) do { if (!(cond)) { fprintf(stderr, "check failed: " #cond "\n"); abort(); }} while (0)

#define ITERATIONS 1000000

typedef struct
{
  _Atomic(void*) current;
  _Atomic(void*) live;
} hazard_ptr;

void*
hazard_load(hazard_ptr* hazard, void* new)
{
    void *old = atomic_load_explicit(&hazard->current, memory_order_relaxed);
    atomic_store_explicit(&hazard->current, new, memory_order_release);
    return old;
}

bool
hazard_loaded(const hazard_ptr* hazard, const void* new)
{
    void* live = atomic_load_explicit(&hazard->live, memory_order_relaxed);
    return live == new ? true : false;
}

void*
hazard_deref(hazard_ptr* hazard)
{
    void* current = atomic_load_explicit(&hazard->current, memory_order_consume);
    atomic_store_explicit(&hazard->live, current, memory_order_relaxed);
    return current;
}

int
producer(void* arg)
{
    hazard_ptr* hazard = arg;

    for (size_t ix = 0; ix < ITERATIONS; ++ix)
    {
        size_t *new = calloc(1, sizeof(size_t));
	*new = ix;

	size_t *old = hazard_load(hazard, new);
	while (!hazard_loaded(hazard, new)) {}
	free(old);
    }

    // Finally, load NULL to indicate the producer is done
    size_t *old = hazard_load(hazard, NULL);
    while (!hazard_loaded(hazard, NULL)) {}
    free(old);

    return 0;
}

int
main(int argc, char *argv[])
{
    hazard_ptr h = {
      .current = calloc(1, sizeof(size_t)),
      .live = NULL
    };

    CHECK(h.current != NULL);

    thrd_t feed;
    CHECK(thrd_create(&feed, producer, &h) == thrd_success);

    for (const size_t* v = hazard_deref(&h); v != NULL; v = hazard_deref(&h))
    {
        printf("Current: %zu\n", *v);
    }

    CHECK(thrd_join(feed, NULL) == thrd_success);
    return 0;
}

hazard_ptr contains two atomics. current is the one which is used for the bulk of synchronization, while live is used as a flag by the producer to gauge when an updated pointer has propagated to the consumer thread.

Note that in hazard_deref the memory_order_consume constraint prevents the store to live from being reordered before the read from current. Since the producer is only looking to read the updated value of live at some point, memory_order_relaxed suffices for the store to live in hazard_deref.

Similarly, in hazard_load, the load from current can be kept relaxed, since the subsequent release ensures it happens before the store to current (and would need to ensure program order made sense). Moreover, the hazard_load is the only function which mutates current, and hence the thread will see the latest data anyway.

It's expected that current will point to some data which the consumer wants to use in some way. This means the writes to that data must be visible to other threads before the write to current. This motivates the release/consume combination of order guards in hazard_load and hazard_deref.