Reputation: 2227
I am trying to write my first multi-threaded program in C under Linux. I already have a program playing with setting and resetting bits in a big buffer, now I just want to make it faster - as fast as possible without writing everything in assembler.
For the single-threaded program, I defined my own macros for the bit manipulation (they kind of look big and ugly, but they work):
#define CELL_SIZE (64)
#define getBit(bitIdx, arr) ( arr[ (bitIdx) / CELL_SIZE ] & ( (1uL << (CELL_SIZE - 1)) >> (bitIdx) % CELL_SIZE))
#define resetBit(bitIdx, arr) ( arr[ (bitIdx) / CELL_SIZE ] &= ~( (1uL << (CELL_SIZE - 1)) >> (bitIdx) % CELL_SIZE))
Obviously, those operations are anything BUT atomic.
I did some research and I found several answers. Some answers are about how to write my own functions in assembler (which I want to avoid).
Other sites (this or this) seem to talk exactly about what I need - except that I have no idea how to use those interfaces / macros.
I tried the following includes, but they do not work.
#include <linux/bitmap.h>
#include <asm/bitops.h>
So, basically, I ask how can I use what is already implemented in order to get my work done? Which header(s) should I include?
I do not mind if the interfaces are not in the kernel, as long as the job gets done.
Note: each thread would (re)set exactly one bit at one time - I do not plan any VOODOO complicated stuff.
A side question would be: which one would be faster (overall, written in pseudo-code):
reset_bit();
or
if ( bit_is_set() )
reset_bit();
It is quite often that the reset_bit() operation is not necessary, as the bit is already reset. Same question about setting a set bit.
Upvotes: 3
Views: 1245
Reputation: 58493
To answer the question directly, the most straightforward and readily available solution is C11 <stdatomic.h>
which any modern C compiler for a multiprocessor system should provide.
Available operations include atomic OR/AND which you can use to set and clear bits. So you might do
#include <stdatomic.h>
#include <stdint.h>
#include <stddef.h>
typedef uint64_t cell_t;
#define CELL_SIZE (64)
static inline _Atomic cell_t *get_cell(size_t bitIdx, _Atomic cell_t *arr) {
return arr + (bitIdx / CELL_SIZE);
}
static inline cell_t get_mask(size_t bitIdx) {
return (1uL << (CELL_SIZE - 1)) >> (bitIdx) % CELL_SIZE;
}
static inline cell_t getBit(size_t bitIdx, _Atomic cell_t *arr) {
return atomic_load(get_cell(bitIdx, arr)) & get_mask(bitIdx);
}
static inline void resetBit(size_t bitIdx, _Atomic cell_t *arr) {
atomic_fetch_and(get_cell(bitIdx, arr), ~get_mask(bitIdx));
}
static inline void setBit(size_t bitIdx, _Atomic cell_t *arr) {
atomic_fetch_or(get_cell(bitIdx, arr), get_mask(bitIdx));
}
I took the liberty of replacing your macros with inline functions, which are preferable in practically all situations.
You might be able to speed things up somewhat, on some platforms, by using the *_explicit
versions with memory_order_relaxed
, but it depends greatly on how the functions will be used, and an introduction to memory ordering is outside the scope of this post.
The performance implications are harder. Atomic read-modify-write operations are a lot slower than non-atomic, usually. And if there is contention between CPUs for cache lines, that slows things down further. Unless you can control these aspects really well, your plan of putting more cores to work on the task will probably make it slower.
As to your last question about whether to test the bit first, hard to say. A load is generally faster than a read-modify-write. However, the cost of doing the test and conditional branch, and especially the possibility of mispredicting that branch, may nullify the savings.
Upvotes: 5