Svetoslav Ilkov Enchev
Svetoslav Ilkov Enchev

Reputation: 381

Mixing memory ordering memory_order_seq_cst and memory_order_relaxed in a SPMC queue?

I'm implementing a chunk-oriented shared ring-buffer for single producer, multiple consumers.

The buffer keeps track of the current iteration (i.e. wrap count) and the last written chunk index. When writing, the producer first marks the chunk's iteration field as dirty, fills in the other fields (offset/size/etc), then marks the iteration field as non-dirty, with the the same iteration number of the whole buffer. After that, it increments the last written chunk index appropriately. When reading, the consumers would read the chunk descriptor, verify the iteration is the expected one and it's not being written to, read the data, and verify again.

struct chunk_descriptor_t {
  atomic_uint_fast32_t iteration;
  atomic_uint_fast32_t chunk_offset;
  atomic_uint_fast32_t chunk_size;
  atomic_uint_fast32_t user_data;
};

struct shared_buf_descriptor_t {
  atomic_uint_fast32_t total_data_size;
  atomic_uint_fast32_t num_chunks;
  atomic_uint_fast32_t iteration;
  atomic_uint_fast32_t last_written_chunk_idx;
  struct chunk_descriptor_t chunks_descriptors[];
};

typedef enum e_iter_flags {
  ITER_VALUE_MASK = (1 << 30) - 1,
  ITER_PRODUCER_DIRTY = 1 << 30,
  ITER_PRODUCER_RUNNING = 1 << 31
} iter_flags_t;

The producer would write the next chunk like this:

static void add_chunk(shared_buf_producer_t producer, const char *data,
                      uint32_t chunk_size, uint32_t user_data) {
  uint_fast32_t chunk_idx = (producer->last_written_chunk_idx + 1) %
                            producer->shbuf->header.num_chunks;
  struct chunk_descriptor_t *chunk =
      &producer->shbuf->descriptor->chunks_descriptors[chunk_idx];
  // 1. mark the chunk dirty. This must happen first.
  chunk->iteration =
      ITER_PRODUCER_DIRTY | (producer->iteration & ITER_VALUE_MASK);
  
  // 2. update the chunk fields and write data. This must happen after 1 and before 3.
  // The order inside doesn't matter.
  chunk->chunk_size = chunk_size;
  chunk->user_data = user_data;
  chunk->chunk_offset = producer->write_offset;
  memcpy(producer->shbuf->header.data + producer->write_offset, data,
         chunk_size);
  
  // 3. mark the chunk clean. This must happen after 2 and before 4. 
  chunk->iteration = producer->iteration & ITER_VALUE_MASK;
  
  // 4. update the last written index. This must happen last.
  producer->shbuf->descriptor->last_written_chunk_idx = chunk_idx;
  
  // producer bookkeeping
  producer->write_offset += chunk_size;
  producer->last_written_chunk_idx = chunk_idx;
}

So I need to ensure the memory ordering. Currently, memory_order_seq_cst is used for all stores. However, this would also order the stores in (2), where the order inside isn't important, as long as (2) as a whole comes after (1) and before (3).

I wonder if I can relax the ordering in (2). And if so, should those be memory_order_relaxed stores, or other fields should not be atomic at all maybe?

Upvotes: 2

Views: 316

Answers (2)

Svetoslav Ilkov Enchev
Svetoslav Ilkov Enchev

Reputation: 381

After furhter research, I believe the correct implementation should look like this:

struct chunk_descriptor_t {
  atomic_uint_fast32_t iteration;
  uint_fast32_t offset;
  uint_fast32_t size;
  uint_fast32_t user_data;
};
void put_chunk(shared_buf_producer_t producer, uint_fast32_t iteration,
               uint_fast32_t chunk_idx, uint_fast32_t chunk_size,
               uint_fast32_t user_data, const char *data) {
  struct chunk_descriptor_t *chunk =
      &producer->shbuf->descriptor->chunks_descriptors[chunk_idx];
  uint_fast32_t chunk_offset = producer->write_offset;
  // 1. Mark the chunk dirty. This happens first. All previous stores would
  // "happen before" this one.
  atomic_store_explicit(&chunk->iteration,
                        ITER_PRODUCER_DIRTY | (iteration & ITER_VALUE_MASK),
                        memory_order_release);
  // 2. Make sure stores to non-atomic data are not reordered before the chunk
  // is marked dirty
  atomic_thread_fence(memory_order_release);
  chunk->offset = chunk_offset;
  chunk->size = chunk_size;
  chunk->user_data = user_data;
  memcpy(producer->shbuf->header.data + chunk_offset, data, chunk_size);

  // 3. Mark the chunk as clean. Ensure non-atomic data is not reordered after
  // the chunk is marked clean.
  atomic_store_explicit(&chunk->iteration, iteration & ITER_VALUE_MASK,
                        memory_order_release);
  // 4. Update the last written index. This happens last.
  atomic_store_explicit(&producer->shbuf->descriptor->last_written_chunk_idx,
                        chunk_idx, memory_order_release);

  // bookkeeping
  producer->write_offset = (producer->write_offset + chunk_size) %
                           producer->shbuf->header.total_data_size;
  producer->last_written_chunk_idx = chunk_idx;
}
int try_read_chunk(shared_buf_consumer_t consumer,
                   uint_fast32_t expected_iteration, uint_fast32_t chunk_idx,
                   uint_fast32_t *user_data, autogrow_buf_t out_buf) {
  struct chunk_descriptor_t *chunk =
      &consumer->shbuf->descriptor->chunks_descriptors[chunk_idx];

  // 1. Check if the chunk is clean. If it is, we can read it.
  // Make sure subsequent loads of non-atomic data are not reordered before.
  uint_fast32_t chunk_iteration =
      atomic_load_explicit(&chunk->iteration, memory_order_acquire);
  if (chunk_iteration != expected_iteration) {
    return chunk_iteration;
  }
  // 2. Read the chunk size and offset.
  uint_fast32_t chunk_offset = chunk->offset;
  uint_fast32_t chunk_size = chunk->size;
  uint_fast32_t chunk_user_data = chunk->user_data;

  // 3. Make sure previous loads of non-atomic data are not reordered after.
  atomic_thread_fence(memory_order_acquire);

  // 4. Check if the chunk is still clean. If it is, the size and offsets are
  // consistent and can be used to read the data.
  chunk_iteration =
      atomic_load_explicit(&chunk->iteration, memory_order_acquire);
  if (chunk_iteration != expected_iteration) {
    return chunk_iteration;
  }
  // 5. Read the chunk data
  autogrow_buf_add_data(out_buf, consumer->shbuf->header.data + chunk_offset,
                        chunk_size);
  // 6. Make sure previous loads of non-atomic data are not reordered after.
  atomic_thread_fence(memory_order_acquire);
  // 7. Check if the chunk is still clean. If it is, we're done.
  chunk_iteration =
      atomic_load_explicit(&chunk->iteration, memory_order_acquire);
  if (chunk_iteration != expected_iteration) {
    return chunk_iteration;
  }
  *user_data = chunk_user_data;
  return chunk_iteration;
}

First, "single total modification order" is not needed, as access will be done using acquire/release semantics from relevant threads.

Seconds, besides the "iteration" field of the chunk descriptor, which is used as a kind of lock, other fields (and the shbuf->data) don't need to be atomic and can be access using normal load/stores.

However, those non-atomic stores in the producer must be guaranteed to not happen before marking the chunk as dirty; and checking if the chunk is dirty in the producer must be guaranteed to happen before the non-atomic loads. This is where the atomic_thread_fence(acquire/release) come to action, these provide exactly those guarantees (and were missing from the question).

Note: On strongly-ordered architectures like x86-64, no barriers are added. One thread only does stores (so that would be StoreStore), and the other threads only do loads (that would be LoadLoad). The fences make sure that the compiler doesn't reorder the non-atomic stores or loads. On ARM, depending on the architecture level, atomic_load/store_explicit(acquire/release) would issue lda/stl. The atomic_thread_fence(acquire/release) should issue dmb ishld/ishst (although it seems that GCC is conservative and uses the dmb ish for both acquire/release).

Upvotes: 0

Gold
Gold

Reputation: 136

I strongly recommand using mpmc-queue cause your question is really difficult to answer without writing a book and even if you get one you may discover months if not years later it's culpits. Lockfree is hard, if you're developer like me and never spent 1 month debugging a simple lf linked list on a 40 core machine you cannot imagine.

From the begining :

struct chunk_descriptor_t {
  atomic_uint_fast32_t iteration;
  atomic_uint_fast32_t chunk_offset;
  atomic_uint_fast32_t chunk_size;
  atomic_uint_fast32_t user_data;
};

That's OK to declare fields atomic but it's only the begining cause here you introduce false sharing due to cache line granularity. if you have at least 2 core accessing the same data they'll be serialized by L3 cache on x86 then you'll lose any benefits going multicore.

Second, the entire struct chunk_descriptor_t is atomic, not just one particular field, it should be protected by a mutex whenever you change something within that field for sake of temporal coherency that means assigning like that is fundamentally wrong cause this 3 lines should be atomic all together:

chunk->chunk_size = chunk_size;
chunk->user_data  = user_data;
chunk->chunk_offset = producer->write_offset;

On x86 the trick is that every memory access are done implicitly in order giving you a false security feeling, sadly it always turn out to be wrong on platform like ARM.

So the aswer is no you cannot relax anything provided that you read/write field in the same order everywhere in your program, but that's really a non issue since this way of doing lockfree is wrong fundamentally.

Upvotes: 1

Related Questions