Shaun Holtzman
Shaun Holtzman

Reputation: 211

OpenMP Memory Allocation on NUMA Processor

I am currently trying to speed up a simple matrix subtraction benchmark with OpenMP on the Maestro processor, which has a NUMA architecture and is based on the Tilera Tile64 processor. The Maestro board has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: https://i.sstatic.net/RG0fC.png

I am new to the idea of writing applications that are 'NUMA-aware', but the main consensus from what I've read is that data locality is a big part of maximizing performance. When parallelizing code among the cores, I should keep the data being used local to the thread doing the processing as possible.

For this matrix subtraction benchmark (C[i] = A[i] - B[i]), I thought it would be a good idea to allocate each thread its own private A, B, and C arrays with the size being the total work size divided by the number of threads. So for example if the total size of the arrays were 6000*6000 and I was trying to parallelize it across 20 threads, I would allocate private arrays with size (6000*6000)/20. Each thread would do this subtraction on its own private array and then I would gather the results back into a final array of the total size 6000*6000. For example (without the gathering of results from each thread into a final array):

int threads = 20;
int size = 6000;
uint8_t *C_final = malloc(sizeof(uint8_t)*(size*size));
#pragma omp parallel num_threads(threads) private(j)
{
     uint8_t *A_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
     uint8_t *B_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
     uint8_t *C_priv = malloc(sizeof(uint8_t)*((size*size)/threads));

     for(j=0; j<((size*size)/threads); j++)
       {
            A_priv[j]=100;
            B_priv[j]=omp_get_thread_num();
            C_priv[j]=0;
       }

     for(j=0; j<((size*size)/threads); j++)
       {
           C_priv[j] = A_priv[j]-B_priv[j];
       }
}

The initial values for the arrays are arbitrary, I just have omp_get_thread_num() in there so I get different values in C_priv from each thread. I'm currently experimenting with the User Dynamic Network that the board has that provides hardware to route packets between CPUs in order to accumulate all of the individual thread results into a final resulting array.

I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup.

Is this a proper way to go about this type of problem? What type of techniques should I look into for getting speedup on a NUMA architecture for a problem like this that uses OpenMP?

Edit:

For clarification, this is what I originally tried and where I noticed a slower execution time than if I just ran the code serially:

     int threads = 20;
     int size = 6000;
     uint8_t *A_priv = malloc(sizeof(uint8_t)*(size*size));
     uint8_t *B_priv = malloc(sizeof(uint8_t)*(size*size));
     uint8_t *C_priv = malloc(sizeof(uint8_t)*(size*size));

     int i;
     for(i=0; i<(size*size); i++)
     {
       A[i] = 10;
       B[i] = 5;
       C[i] = 0;
     }

     #pragma omp parallel for num_threads(threads)
     for(i=0; i<(size*size); i++)
     {
       C[i] = A[i] - B[i];
     }

After seeing that I was getting a slower execution time when using OpenMP, I tried looking into why that is the case. It seemed as though data locality was the issue. This assumption is based on what I have read up about NUMA architectures.

I am having a hard time trying to figure out how to alleviate the bottlenecks that are slowing it down. I found some help with similar questions like this: OpenMP: for schedule where it walks about allocating data to each thread so each thread works on its local data.

I just feel like something as simple as a matrix subtraction should not be difficult to get increased performance when using OpenMP. I'm not sure how to go about figuring out what exactly the bottleneck is and how to alleviate it.

Upvotes: 5

Views: 1002

Answers (1)

Aaron Altman
Aaron Altman

Reputation: 1755

On a quick search and scan of the TILE64 datasheet, it doesn't look like the architecture exposes performance counters like what you'd use on x86 via tools like oprofile, VTune or xperf. Without those, you'll have to devise some experiments of your own to iteratively narrow down on what portion of the code is hot and why - in the absence of microarchitectural docs along with tools to indicate how your code is exercising the hardware, a bit of a reverse engineering task.

Some ideas about where to start on that:

  1. Do some scaling experiments. Is there a knee in the curve where going over a certain problem size or number of threads has a big effect on the overall performance? Does that number hint at some clear relationship with the size of a certain level in the memory hierarchy, or a dimension of the grid of processors, or similar?
  2. Record execution times at a few points through the program. It would probably be useful to know, for example, at a high level how much time is spent on the mallocs vs. the first loop vs. the second.
  3. "I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup." - this worry is also empirically testable, especially if you're working on a large enough problem size that your timer accuracy as in (2) is not an issue for isolating time taken for the gather step vs. the part that's completely parallelizable.
  4. Try a different operation - say, addition or element-wise division instead of subtraction and see if that changes the results. On many architectures different arithmetic operations have different latency and throughput. If you looked up and found that that was the case for the TILE64, making a change like this and instrumenting the runtime of your second example might tell you something useful about how much of the time spent over running it serially actually has to do with data locality issues vs. startup time or other overhead related to the OpenMP runtime that might have more to do in the overall results with its relationship to a small problem size than with the properly parallel part of the parallel implementation actually running slower.
  5. You could examine generated assembly. The assumption that the compiler would do basically the same things in the examples you've posted seems reasonable, but doesn't necessarily hold as strongly as you would want it to when looking at odd performance. Maybe there's something about the code size or layout that changes with/without OpenMP or when moving from one parallel approach to another, like use of instruction cache, availability of reservation station or ROB entries (if the TILE64 has those things)...? Who knows, until you look.

Upvotes: 1

Related Questions