performancegcccompiler-optimizationcpu-cacheprefetch

Reputation: 889

Speed up random memory access using prefetch

I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does:

It uses two int buffers of the same size

It reads one-by-one all the values of the first buffer

It reads the value at the index in the second buffer

It sums all the values taken from the second buffer

It does all the previous steps for bigger and bigger

At the end, I print the number of voluntary and involuntary CPU

In the very first time, values in the first buffers contains the values of its index (cf. function createIndexBuffer in the code just below) .

It will be more clear in the code of my program:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>

#define BUFFER_SIZE ((unsigned long) 4096 * 100000)


unsigned int randomUint()
{
  int value = rand() % UINT_MAX;
  return value;
}


unsigned int * createValueBuffer()
{
  unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    valueBuffer[i] = randomUint();
  }

  return (valueBuffer);
}


unsigned int * createIndexBuffer()
{
  unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    indexBuffer[i] = i;
  }

  return (indexBuffer);
}


unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
  unsigned long long sum = 0;

  for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
  {
    unsigned int index = indexBuffer[i];
    sum += valueBuffer[index];
  }

  return (sum);
}


unsigned int computeTimeInMicroSeconds()
{
  unsigned int * valueBuffer = createValueBuffer();
  unsigned int * indexBuffer = createIndexBuffer();

  struct timeval startTime, endTime;
  gettimeofday(&startTime, NULL);

  unsigned long long sum = computeSum(indexBuffer, valueBuffer);

  gettimeofday(&endTime, NULL);

  printf("Sum = %llu\n", sum);
  free(indexBuffer);
  free(valueBuffer);

  return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);

}


int main()
{
  printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
  unsigned int timeInMicroSeconds = computeTimeInMicroSeconds();
  printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}

If I launch it, I get the following output:

$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch 
sizeof buffers = 1562Mb
Sum = 439813150288855829
Time: 201172 micro-seconds = 0.201 seconds

Quick and fast!!! According to my knowledge (I may be wrong), one of the reason for having such a fast program is that, as I access my two buffers sequentially, data can be prefetched in the CPU cache.

We can make it more complex in order that data is (almost) prefeched in CPU cache. For example, we can just change the createIndexBuffer function in:

unsigned int * createIndexBuffer()
{
  unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    indexBuffer[i] = rand() % BUFFER_SIZE;
  }

  return (indexBuffer);
}

Let's try the program once again:

$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch 
sizeof buffers = 1562Mb
Sum = 439835307963131237
Time: 3730387 micro-seconds = 3.730 seconds

More than 18 times slower!!!

We now arrive to my problem. Given the new createIndexBuffer function, I would like to speed up computeSum function using prefetch

unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
  unsigned long long sum = 0;

  for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
  {
    __builtin_prefetch((char *) &indexBuffer[i + 1], 0, 0);
    unsigned int index = indexBuffer[i];
    sum += valueBuffer[index];
  }

  return (sum);
}

of course I also have to change my createIndexBuffer in order it allocates a buffer having one more element

I relaunch my program: not better! As prefetch may be slower than one "for" loop iteration, I may prefetch not one element before but two elements before

    __builtin_prefetch((char *) &indexBuffer[i + 2], 0, 0);

not better! two loops iterations? not better? Three? **I tried it until 50 (!!!) but I cannot enhance the performance of my function computeSum.

Can I would like help to understand why Thank you very much for your help

Upvotes: 1

Answers (4)

FPK

Reputation: 2098

Prefetch fetches normally a full cache line. This is typically 64 bytes. So the random example fetches always 64 bytes for a 4 byte int. 16 times the data you actually need which fits very well with the slow down by a factor of 18. So the code is simply limited by memory throughput and not latency.

Upvotes: 1

Philippe MESMEUR

Reputation: 889

Then I adapted my program to try your suggestion using the sin function.

My adapted program is the following one:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>

#define BUFFER_SIZE ((unsigned long) 4096 * 50000)


unsigned int randomUint()
{
  int value = rand() % UINT_MAX;
  return value;
}


unsigned int * createValueBuffer()
{
  unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    valueBuffer[i] = randomUint();
  }

  return (valueBuffer);
}


unsigned int * createIndexBuffer(unsigned short prefetchStep)
{
  unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + prefetchStep) * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    indexBuffer[i] = rand() % BUFFER_SIZE;
  }

  return (indexBuffer);
}


double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
  double sum = 0;

  for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
  {
    __builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
    unsigned int index = indexBuffer[i];
    sum += sin(valueBuffer[index]);
  }

  return (sum);
}


unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
  unsigned int * valueBuffer = createValueBuffer();
  unsigned int * indexBuffer = createIndexBuffer(prefetchStep);

  struct timeval startTime, endTime;
  gettimeofday(&startTime, NULL);

  double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);

  gettimeofday(&endTime, NULL);

  printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
  free(indexBuffer);
  free(valueBuffer);

  return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);

}


int main()
{
  printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
  for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
  {
    unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
    printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
  }
}

The output is:

$ gcc TestPrefetch.c -O3 -o TestPrefetch -lm && taskset -c 7 ./TestPrefetch 
sizeof buffers = 781Mb
prefetchStep = 0, Sum = -1107.523504 - Time: 20895326 micro-seconds = 20.895 seconds
prefetchStep = 1, Sum = 13456.262424 - Time: 12706720 micro-seconds = 12.707 seconds
prefetchStep = 2, Sum = -20179.289469 - Time: 12136174 micro-seconds = 12.136 seconds
prefetchStep = 3, Sum = 12068.302534 - Time: 11233803 micro-seconds = 11.234 seconds
prefetchStep = 4, Sum = 21071.238160 - Time: 10855348 micro-seconds = 10.855 seconds
prefetchStep = 5, Sum = -22648.280105 - Time: 10517861 micro-seconds = 10.518 seconds
prefetchStep = 6, Sum = 22665.381676 - Time: 9205809 micro-seconds = 9.206 seconds
prefetchStep = 7, Sum = 2461.741268 - Time: 11391088 micro-seconds = 11.391 seconds
...

So here, it works better! Honestly, I was almost sure that it will not be better because the math function cost is higher compared to the memory access.

If anyone could give me more information about why it is better now, I would appreciate it

Thank you very much

Upvotes: 0

Philippe MESMEUR

Reputation: 889

Sorry. What I gave you was not the correct version of my code. The correct version is, what you said:

__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);

However, even with the right version, it is unfortunately not better

Upvotes: 0

gudok

Reputation: 4179

I believe that above code is automatically optimized by CPU without any further space for manual optimization.

1. Main problem is that indexBuffer is sequentially accessed. Hardware prefetcher senses it and prefetches further values automatically, without need to call prefetch manually. So, during iteration #i, values indexBuffer[i+1], indexBuffer[i+2],... are already in cache. (By the way, there is no need to add artificial element to the end of array: memory access errors are silently ignored by prefetch instructions).

What you really need to do is to prefetch valueBuffer instead:

__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + 1]], 0, 0);

2. But adding above line of code won't help either in such simple scenario. Cost of accessing memory is hundreds of cycles, while add instruction is ~1 cycle. Your code already spends 99% of time in memory accesses. Adding manual prefetch will make it this one cycle faster and no better.

Manual prefetch would really work well if your math were much more heavy (try it), like using an expression with large number of non-optimized out divisions (20-30 cycles each) or calling some math function (log, sin).

3. But even this doesn't guarantee to help. Dependency between loop iterations is very weak, it is only via sum variable. This allows CPU to execute instructions speculatively: it may start fetching valueBuffer[i+1] concurrently while still executing math for valueBuffer[i].

Upvotes: 4

Speed up random memory access using prefetch

Answers (4)

Related Questions