Jorge Leitao
Jorge Leitao

Reputation: 20163

N-Body problen: Efficient parallelization of the double for loop

A very common problem on a N-body problem is the use of a double cycle to calculate the interactions between the particles. Considering a N body problem with n particles, the cycle can be written has

for (i = 0, i < n; i++)
    for (j = i+1, j < n; j++)
        // calculate interaction

My question is about how can this cycle be parallelized using different threads. The objective is that each thread would "ideally" have to calculate the same number of interactions.

My idea was to separate the outer cycle, the i-cycle, on different intervals, say a_k=a(k), where k = 1,2,...,p where p is the number of threads we want to divide the problem into.

So, the cycle could be written as

for (k = 1, k < p; k++)
    for (i = a(k), i < a(k+1); i++)
        for (j = i+1, j < n; j++)
            // calculate interaction

Where the most outer cycle, the k-cycle, is the one to be parallelized.

Because the number of interactions of the most inner cycle, the j-cycle, is n-(i+1), the number of interactions calculated by each thread is

\sum_{i=a(k)}^{a(k+1)} n - (i+1)

This means that one would like find the discrete function a_k such that the functional

f[a_k] = \sum_{i=a(k)}^{a(k+1)} n - (i+1)

with the boundary conditions a(1)=0 and a(p)=n is a constant functional, thus forcing the number of interactions on each thread to be the same.

I've tried using different "heuristics"(e.g. a_k polynomial, exponential, log), and so far none have gave me a satisfactory answer. A direct solution of this problem is not evident to me.

For small p, this problem can be put on the "minimization sack problems" where basically each a_k is a variable to minimize the function

f(a_1,a_2,a_3,...) = sum(|f[a_k] - n/p|^2)

But has you might guess, this is not efficient (or even converge) for higher values of p.

Does anyone have any idea of how could this problem be tackled?

Upvotes: 8

Views: 302

Answers (4)

DGH
DGH

Reputation: 11549

(Sorry if this is not expressed clearly, it makes sense in my head).

When adding up all the numbers from 1 to N, you can notice that N + 1 = (N - 1) + 2 = (N - 2) + 3, etc.

So, what if each thread used one small i and one large i, such that the sums always added up?

Or, say you wanted to always use 5 threads. Thread 1 would do the first 10% and the last 10%, thread 2 would do the second 10% and second-to-last 10%, and so on. Each pairing of an 'early' and 'late' section would add up to the same total number of interactions.

EDIT:

Stealing a diagram from another post...

   0 1 2 3 4 5 6 7 8

0  - A B C D D C B A
1    - B C D D C B A  
2      - C D D C B A
3        - D D C B A  
4          - D C B A
5            - C B A
6              - B A
7                - A
8                  -

Does that show more clearly what I mean?

Upvotes: 3

Jorge Leitao
Jorge Leitao

Reputation: 20163

Today I just found the solution. I'm not accepting it till someone confirms it

In order to f[a_k] be a constant function with respect to k, then

f[a_{k+1}] - f[a_k] = 0

must be true for k = 1,2,3,...,p-1.

We can expand this equation using the definitions I've posted on the question, and we arrive to a system of "p" 2º order algebraic equations with respect to a_k, k=1,2,3,...,p. I'm not seeing a closed solution to an arbitrary p, but it can be analytically solved for each p.

I've confirmed that:

  1. the sum, when using the a_k I've calculated was n(n-1)/2, the total number of interactions of this problem.

  2. the number of interactions per thread is indeed constant for p = 2,3,4,5 and 10 (where the p=10 took some time to calculate on the mathematica®).

EDIT

After detailed inspection of the solutions for different values of p, I've reached to the general closed solution

a_k = 1/(2 p) (-p + 2 p n - sqrt[p^2 + 4 p (p + 1 - k) (n - 1) n])

which is valid for every p>=2, n>1.

This completes the answer.

Upvotes: 0

CAFxX
CAFxX

Reputation: 30331

Supposing your compiler supports OpenMP, why can't you simply try to do

#pragma omp parallel for schedule(dynamic) // or: schedule(guided)
for (i = 0; i < n; i++)
    for (j = i+1; j < n; j++)
        // calculate interaction

or even (you'll need to benchmark to understand which one performs better)

#pragma omp parallel
const int stride = omp_get_num_threads() + 1; 
for (i = omp_get_thread_num(); i < n; i += stride)
    for (j = i+1; j < n; j++)
        // calculate interaction

Upvotes: 2

comingstorm
comingstorm

Reputation: 26117

You can divide your objects into k groups of roughly N/k bodies, and use this to dissect your initial triangle of interactions into k*(k + 1)/2 pieces:

   0 1 2 3 4 5 6 7 8
                      -- N=9;  k=3;  N/k=3
0  - A A B B B C C C
1    - A B B B C C C  -- diagonal pieces:  A, D, F
2      - B B B C C C
3        - D D E E E  -- non-diagonal pieces: B, C, E
4          - D E E E
5            - E E E
6              - F F
7                - F
8                  -

This view is complicated by the fact that there are two kinds of pieces: those along the diagonal (which are triangles with (N/k)*(N/k - 1)/2 elements), and those which are not (which are squares with (N/k)*(N/k) elements). However, since the diagonal pieces are roughly half the size of the square pieces, you can assign two to each thread to balance the load -- for a total of k*k/2 roughly-equal tasks.

An advantage of this method is that each task only needs to access the data for 2*N/k bodies, which could make it significantly more cache-friendly.

Upvotes: 3

Related Questions