momo123
momo123

Reputation: 51

Openmp simd (increment vector)

I have tried to apply #pragma omp simd to the following code (loops) but it does not seem to work (no speed improvement). I also tried #pragma omp simd linear but all my attempts resulted in a seg fault.

https://github.com/Rdatatable/data.table/blob/master/src/fsort.c#L209 https://github.com/Rdatatable/data.table/blob/master/src/fsort.c#L184

Is it even possible to increment a vector with simd? Example:

#include <stdio.h>
#include <stdlib.h>

int main() {
  int len = 1000;
  int tmp[len];
  for(int i=0; i<len; ++i) {
    tmp[i]=rand()%100;
  }
  int *thisCounts = (int *) calloc(len, sizeof(int));
  for (int j=0; j<len; ++j) {
    thisCounts[tmp[j]]++;
  }
  for (int j=0; j<len; ++j) {
    printf("%d, ",thisCounts[j]);
  }
  free(thisCounts);
  return 0;
}

FYI, line 209 is the one that takes most time and I am trying to improve.

Thank you

Upvotes: 0

Views: 313

Answers (1)

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50816

It depends of the target hardware architecture. Many processor architectures does not have SIMD instruction performing such kind of indirect accesses. On mainstream x86-64 processors, there is a scatter/gather instruction to perform such a computation. However, they are not efficiently implemented and thus not significantly faster than using non-SIMD instructions. Moreover, using them is difficult here since there is possibly some increment conflicts (if tmp[j1] == tmp[j2] with j1 != j2. The AVX-512 SIMD instruction set contains interesting instructions for that but it is only available on few recent processors. The same apply for ARM with SVE/SVE2 which is very new and not yet available on the vast majority of ARM processors.

Thus, put it shortly, there is very slight chance your processor can possibly do that using SIMD instructions, but it does not means it is not possible on all architecture. Note also that using #pragma omp simd is likely not correct here because of possible conflicts. Note also that the speed of this operation is likely dependent of the input data on a lot of modern processors (random data do not behave like most real-world possible inputs).

Upvotes: 2

Related Questions