Concurrent access to elements in the same cacheline in non-shared cache on x86-64

Question

Assume I have the following code:

int x[200];

void thread1() {
  for(int i = 0; i < 100; i++)
    x[i*2] = 1;
}

void thread2() {
  for(int i = 0; i < 100; i++)
    x[i*2 + 1] = 1;
}

Is the code correct in x86-64 memory model (from what I understand it is) assuming the page was configured with default write cache policy in Linux? What is the impact on performance of such code (from what I understand - none)?

PS. As of performance - I am mostly interested in Sandy Bridge.

EDIT: As of expectation - I want to write to aligned locations from different threads. I expect the upper code after finishing and barrier to contains {1,1,1, ...} in x rather then {0,1,0,1,...} or {1,0,1,0,...}.

Maja Piechotka · Accepted Answer

If I understand correctly the writes will eventually propagate by snooping requests . The Sandy Bridge uses Quick Path between cores so the snooping would not hit FSB but would use much quicker interconnection. As it is not based on cache-invalidation-on-write it should be 'fairly' quick although I wasn't able to find what is the overhead of conflict resolution (but probably lower then L3 write).

Source

EDIT: According to Intel® 64 and IA-32 Architectures Optimization Reference Manual clean hit have impact of 43 cycles and dirty hit have impact of 60 cycles (compared with 4 cycles normal overhead for L1, 12 for L2 and 26-31 for L3).

Concurrent access to elements in the same cacheline in non-shared cache on x86-64

Answers (1)

Related Questions