office-account
office-account

Reputation: 9

How granular can multithreaded memory-write access be?

I've read about how NUMA works and that memory is pulled in from RAM through L2 and L1 caches.

And that there are only two ways to share data:

But how granular can the data be for the access to be safe?

For example if i have 2 uint8 variables on stack and i'm passing them by pointers to two separate threads, can one of them read the first variable and the other one write to the second variable?

How granular does this idea of memory safety have to be? like how much gap should there be between two pieces of memory for them to be accessible from different threads

Asking because i've also read a bit about how allocators work and what happens if i allocate two contiguous arrays on an allocator, read the first array in thread A and write to the second array in thread B? would that cause any problems?

Upvotes: 0

Views: 84

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50488

can one of them read the first variable and the other one write to the second variable?

Yes. Independent variable can be safely accessed from different thread. At least, in nearly all languages. Executing a program on a NUMA platform does not change this.

That being said, if the two variables are stored in the same cache line, then the latency of the accesses can be much higher. Indeed, while the cache coherence protocol ensure the safety of the access on mainstream architectures, the write will invalidate the cache line of in the L1 cache of the thread reading the other variable causing the next read to be slower due to a cache miss (this is dependent of the exact cache coherence protocol used though). This problem called false sharing.

Note that cache coherence is still maintained with multiple CPU on a same node though the latency is usually significantly higher than on a platform with 1 mainstream CPU.

But how granular can the data be for the access to be safe?

1 byte is the minimal granularity required by the memory hierarchy. On mainstream platforms it is thus 1 octet (ie. 8 bits).

like how much gap should there be between two pieces of memory for them to be accessible from different threads

Regarding the performance, it is generally enough to align variable accessed by different threads on a cache line boundary. AFAIK, on some processors it can be a bit more, like 2 cache lines, due to cache line prefetching. On mainstream x86-processors, a cache line is 64 bytes.

if i allocate two contiguous arrays on an allocator, read the first array in thread A and write to the second array in thread B? would that cause any problems?

On mainstream platforms, and with mainstream languages, it should only cause performance issues (if any). There are some processors without (an explicit/hardware) cache coherence mechanism but they are very unusual and runtime/compiler should take care of that (as they need to comply with the target language specification that usually do not prevent what you want to do).

Note that allocators tends to align data to 16 bytes on most platforms including x86-64 processors for various reasons (mainly for sake of performance). They also tends to allocate data in a thread local storage so to improve the scalability of the allocator when independent blocks of data are allocated/freed on different threads (ie. no allocation on one thread freed by another one).

Also please note that false sharing can be an even bigger performance issue with atomic accesses since the tends to lock the full cache line on some architecture.

Upvotes: 1

Related Questions