Reputation: 12943
I need to create a sort of a shared object (for whatever reason). It's not limited to the single-threaded usage. Generally in such cases interlocked operations are the way to go (such as InterlockedIncrement
and InterlockedDecrement
on Win32).
Whereas the object reference counting should work correctly in any scenario, I'd like to optimize it for single-threaded usage. Interlocked operations are very much heavier than the standard arithmetic ones. From my measurements an interlocked operation (issuing full memory barrier) takes about 40 CPU cycles on my "typical" CPU, whereas standard arithmetic ones are below any measurement accuracy (thanks to CPU cache).
There's a similar technique when it comes to the memory allocation. There are heap implementations, such as "TCMalloc", that consist of a centralized memory partitioning mechanism, guarded by the appropriate synchronization objects, plus per-thread caching. In the most common scenario the memory allocated/freed on per-thread cache, which doesn't involve any interlocked operations at all, plus CPU cache is utilized with high probability.
Hence I though about a possibility to do something similar for reference-supporting objects. Any ideas how to achieve this? Raw ideas are also welcome.
It's ok in my scenario to delay the actual object destruction for some time, if this improves the performance.
Upvotes: 2
Views: 99
Reputation: 129454
I wouldn't bother. I just ran this benchmark:
#include<stdio.h>
#define SIZE 1000000
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
void print_avg(const char *str, const int *diff, int size)
{
int i;
long sum = 0;
int max = -1, min = 10000;
for(i = 0; i < size; i++)
{
int t = diff[i];
sum += t;
if (t > max) max = t;
if (t < min) min = t;
}
printf("%s average =%f clocks, max =%d, min =%d\n", str, (double)sum / size, max, min);
}
int main()
{
unsigned long long a, b;
int diff[SIZE];
int value = 0;
int i;
for(i = 0; i < SIZE; i++)
{
a = rdtsc();
__sync_fetch_and_add(&value, 2);
b = rdtsc();
diff[i] = (int)(b - a);
}
print_avg("Locked", diff, SIZE);
for(i = 0; i < SIZE; i++)
{
a = rdtsc();
value += 2;
b = rdtsc();
diff[i] = (int)(b - a);
}
print_avg("Not locked", diff, SIZE);
return 0;
}
Compiled with gcc -O2 it gives the following results:
Locked average =105.672402 clocks, max =38756, min =86
Not locked average =80.540389 clocks, max =23433, min =73
I ran it several times, and the results are very similar each time. Please ignore the big numbers for max - that's when the processor takes an interrupt or something - it came from some code I wrote for a different purpose, and I just recycled that for this test. This small difference should apply on all modern processors (Intel iCore and AMD Athlon64 and those sort of generations)
Unless for some reason your compiler doesn't inline InterlockedIncrement, adding an if-statement on your code will most likely cost at least 5 cycles, so you are saving at most 10 cycles. Hopefully you are doing something else than incrementing and decrementing the reference counters.
Edit: Adding a memory barrier doesn't make a big difference either - about 10 cycles.
Admittedly, if I add ten adds in the second loop, it makes about five clock cycles to the each loop (so half a clock per add, on average), where as the locked add takes about 20 clocks per add. It's still not worth adding an if-statement, in my opinion. But if you fancy adding "if (nr_threads == 1) a = a + 1; else a = __sync_fetch_and_add(a, 1);", [or whatever it takes to do what you need] I'm not going to stop you. But make sure that you benchmark your entire application and make sure it makes more than 1% improvement - I doubt it. Please do come back and tell us what the difference is. My added if-statement in the "deallocate page-table entries" for the Linux kernel made it 2-5% slower, so wasn't worth it. But if you find that in your code, it's worth it, be my guest. I'm speaking from experience, and I have the numbers to show it, but if you feel like trying for yourself, sure why not.
Upvotes: 1