northwindow
northwindow

Reputation: 343

Intel MKL multi-threaded matrix-vector multiplication sgemv() slow after little breaks

I need to run a multi-threaded matrix-vector multiplication every 500 microseconds. The matrix is the same, the vector changes every time.

I use Intels sgemv() in the MKL on a 64-core AMD CPU. If I compute the multiplications in a for-loop with no gaps in a little test program, it takes 20 microseconds per call of sgemv(). If I add a spin loop (polling the TSC) that takes about 500 microseconds to the for-loop, the time per sgemv() call increases to 30 microseconds if I use OMP_WAIT_POLICY=ACTIVE, with OMP_WAIT_POLICY=PASSIVE (the default), it goes even up to 60 microseconds.

Does anybody know what could be going on and why it is slower with the breaks? And what can be done to avoid this?

It doesn't seem to make a difference whether the spin loop is single-threaded or in a "#pragma omp parallel" context. It also makes no difference whether I keep the AVX units busy or not in the spin loop. CPU cores are isolated and the test program is running at a high priority and with SCHED_FIFO (on Linux, this is).

Spin wait function:

static void spin_wait(int num)
{
  uint64_t const start = rdtsc();
  while( rdtsc() - start < num )
  {;}
}

for-loop

uint64_t t0[num], t1[num];
for( int i=0; i<num; i++ )    
{
  // modify input vector, just incrementing each element

  t0[i] = rdtsc();
  cblas_sgemv(...);
  t1[i] = rdtsc();
  spin_wait( 500us );
}

Upvotes: 4

Views: 300

Answers (1)

zp3
zp3

Reputation: 5

Might have something to do with context switching since you are not using a „real“ real time OS. Might also be something cache relate (or both). Depending on the prediction algorithms and the size of your problem cache prefetching might simply work better if your code is still „hot“ and you repeat it thousand of times subsequently (even if a us range seems quite large for a cache related cause imho, maybe if ram access is additionally involved). I would also still not exclude frequency scaling as the cause since the processor might run into a power limit forcing it to scale down a bit (AVX2 instructions are usually quite power hungry…)

Upvotes: -3

Related Questions