Reputation: 211
I am currently trying to speed up a simple matrix subtraction benchmark with OpenMP on the Maestro processor, which has a NUMA architecture and is based on the Tilera Tile64 processor. The Maestro board has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: https://i.sstatic.net/RG0fC.png
I am new to the idea of writing applications that are 'NUMA-aware', but the main consensus from what I've read is that data locality is a big part of maximizing performance. When parallelizing code among the cores, I should keep the data being used local to the thread doing the processing as possible.
For this matrix subtraction benchmark (C[i] = A[i] - B[i]), I thought it would be a good idea to allocate each thread its own private A, B, and C arrays with the size being the total work size divided by the number of threads. So for example if the total size of the arrays were 6000*6000 and I was trying to parallelize it across 20 threads, I would allocate private arrays with size (6000*6000)/20. Each thread would do this subtraction on its own private array and then I would gather the results back into a final array of the total size 6000*6000. For example (without the gathering of results from each thread into a final array):
int threads = 20;
int size = 6000;
uint8_t *C_final = malloc(sizeof(uint8_t)*(size*size));
#pragma omp parallel num_threads(threads) private(j)
{
uint8_t *A_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
uint8_t *B_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
uint8_t *C_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
for(j=0; j<((size*size)/threads); j++)
{
A_priv[j]=100;
B_priv[j]=omp_get_thread_num();
C_priv[j]=0;
}
for(j=0; j<((size*size)/threads); j++)
{
C_priv[j] = A_priv[j]-B_priv[j];
}
}
The initial values for the arrays are arbitrary, I just have omp_get_thread_num() in there so I get different values in C_priv from each thread. I'm currently experimenting with the User Dynamic Network that the board has that provides hardware to route packets between CPUs in order to accumulate all of the individual thread results into a final resulting array.
I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup.
Is this a proper way to go about this type of problem? What type of techniques should I look into for getting speedup on a NUMA architecture for a problem like this that uses OpenMP?
Edit:
For clarification, this is what I originally tried and where I noticed a slower execution time than if I just ran the code serially:
int threads = 20;
int size = 6000;
uint8_t *A_priv = malloc(sizeof(uint8_t)*(size*size));
uint8_t *B_priv = malloc(sizeof(uint8_t)*(size*size));
uint8_t *C_priv = malloc(sizeof(uint8_t)*(size*size));
int i;
for(i=0; i<(size*size); i++)
{
A[i] = 10;
B[i] = 5;
C[i] = 0;
}
#pragma omp parallel for num_threads(threads)
for(i=0; i<(size*size); i++)
{
C[i] = A[i] - B[i];
}
After seeing that I was getting a slower execution time when using OpenMP, I tried looking into why that is the case. It seemed as though data locality was the issue. This assumption is based on what I have read up about NUMA architectures.
I am having a hard time trying to figure out how to alleviate the bottlenecks that are slowing it down. I found some help with similar questions like this: OpenMP: for schedule where it walks about allocating data to each thread so each thread works on its local data.
I just feel like something as simple as a matrix subtraction should not be difficult to get increased performance when using OpenMP. I'm not sure how to go about figuring out what exactly the bottleneck is and how to alleviate it.
Upvotes: 5
Views: 1002
Reputation: 1755
On a quick search and scan of the TILE64 datasheet, it doesn't look like the architecture exposes performance counters like what you'd use on x86 via tools like oprofile, VTune or xperf. Without those, you'll have to devise some experiments of your own to iteratively narrow down on what portion of the code is hot and why - in the absence of microarchitectural docs along with tools to indicate how your code is exercising the hardware, a bit of a reverse engineering task.
Some ideas about where to start on that:
Upvotes: 1