Reputation: 391
I am comparing performance of many processes each trying to acquire a semaphore (the semaphore is always in contention) on an ARM8 server with a Linux ubuntu 4.15.0-112
First I used a named posix semaphore and then a system v semaphore where the semaphore set had 1 semaphore.
The performance was 8% worse with the system v semaphore. I know it has worse performance when there is no contention on the semaphore, but I don't understand the performance hit when there is. When running the same on intel I don't see any difference in performance, but I believe something else is the bottleneck when running on intel.
I want to use System V Semaphore for two reasons, release semaphore in case one of the processes crashed and can increment/decrement by any value instead of +-1
This is the code I wrote for the system v semaphore (initialize is called only from 1 process and I get the same performance hit without the undo flag in wait and post)
void init_semaphore(bool initialize, int initCount, int& semId)
{
key_t key;
char* file_path = "/tmp/semv";
char proj_id ='p';
if (initialize)
{
key = ftok(file_path, proj_id);
if (key == -1) {
perror("ftok");
exit(EXIT_FAILURE);
}
int flags = IPC_CREAT | 0666;
semId = semget(key, 1, flags);
if (semId == -1) {
perror("semget");
exit(EXIT_FAILURE);
}
i = semctl(semId, 0, SETVAL, initCount);
if (i == -1) {
perror("semctl");
exit(EXIT_FAILURE);
}
i = semctl(semId, 0, GETVAL);
printf("current value of %d is %d\n", semId, i);
}
else
{
key = ftok(file_path, proj_id);
if (key == -1) {
perror("ftok");
exit(EXIT_FAILURE);
}
semId = semget(key, 1, 0);
if (semId == -1) {
perror("semget");
exit(EXIT_FAILURE);
}
}
}
void release_semaphore(int semId)
{
int i = semctl(semId, 0, IPC_RMID);
if (i == -1) {
perror("semctl: semctl failed");
exit(1);
}
}
void post(int semId)
{
sembuf sops_post[1];
sops_post[0].sem_num = 0;
sops_post[0].sem_op = 1;
sops_post[0].sem_flg = SEM_UNDO;
semop(semId, sops_post, 1);
}
void wait(int semId)
{
sembuf sops_wait[1];
sops_wait[0].sem_num = 0;
sops_wait[0].sem_op = -1;
sops_wait[0].sem_flg = SEM_UNDO;
semop(semId, sops_wait, 1);
}
Upvotes: 2
Views: 762
Reputation: 22395
Paraphrase: Why is SystemV semaphore contention worse on ARM than Intel?
TL;DR - the System V use Linux RCU. The RCU is a lock free algorithm and it will rely on the processor memory model. For Intel CPUs, the TSO memory model is much more forgiving than the ARM memory model.
On ARM Linux, semop
is implemented as a syscall. A syscall has significant overhead. For the Posix, it is implemented via a call in the vector/kuser table with __kuser_cmpxchg. This is like the concept of a vDSO; actually, it is a vDSO on ARM64. On ARM32, it is mapped with the vector table.
The code for the semop
is found in sem.c.
Just look at the complexity of the code! One is involving a syscall which is a mode switch and requires complete banking of user state and a possible context switch. For contention, the posix semaphore will spin in the loop,
.inst 0xe1923f9f // 1: ldrex r3, [r2]
.inst 0xe0533000 // subs r3, r3, r0
.inst 0x01823e91 // stlexeq r3, r1, [r2]
.inst 0x03330001 // teqeq r3, #1
.inst 0x0afffffa // beq 1b
It follows with a barrier to ensure observablity of other cores. The whole mechanics of the semop
is layered on Linux VFS and RCU. It makes sense that for system global semaphores, people may want to use System V. However, for efficiency, the Posix semaphores are far better for single process multi-threaded logic.
The mechanics are night and day when you examine them. Posix will remain within the user mode and executes only near 100 instructions and stays within a tight loop (like a spin lock) in user space. The System V is making a syscall, mode change with 1000s of instructions.
See my answer (near the end) to my question for some more on Posix semaphores. If you plot a histogram of the timing, I would think the System V will show a far higher worse case than the Posix for timing. 8% worse might be a typical/average difference. The source in sem.c describes a worse case as O(n^2), where I assume 'n' is the number of contending lockers. For the Posix, some thread will always have the lock and then first successful strex
will have the update; it is O(1).
The performance was 8% worse with the system v semaphore. I know it has worse performance when there is no contention ARM on the semaphore, but I don't understand the performance hit when there is. When running the same on intel I don't see any difference in performance, but I believe something else is the bottleneck when running on intel.
Intel's memory model is TSO, so the RCU code is probably more efficient in the contention case. For ARM, the memory model is more like the Power. It is going to need many smp_wb()
and smp_rb()
, that probably are not needed on Intel. These instructions will slow down all cores as they are global and need to sync all cores pipelines.
As Wikipedia's memory model shows, other CPUs will show even more performance degradation, such as Alpha.
Upvotes: 2