Dmobb Jr.
Dmobb Jr.

Reputation: 149

Openmp can't create threads automatically

I am trying to learn how to use openmp for multi threading. Here is my code:

#include <iostream>
#include <math.h>
#include <omp.h>
//#include <time.h>
//#include <cstdlib>

using namespace std;

bool isprime(long long num);

int main()
{
        cout << "There are " << omp_get_num_procs() << " cores." << endl;
        cout << 2 << endl;
        //clock_t start = clock();
        //clock_t current = start;
        #pragma omp parallel num_threads(6)
        {
        #pragma omp for schedule(dynamic, 1000)
        for(long long i = 3LL; i <= 1000000000000; i = i + 2LL)
        {
                /*if((current - start)/CLOCKS_PER_SEC > 60)
                {
                        exit(0);
                }*/
                if(isprime(i))
                {
                        cout << i << " Thread: " << omp_get_thread_num() << endl;
                }
        }
        }
}

bool isprime(long long num)
{
        if(num == 1)
        {
                return 0;
        }
        for(long long i = 2LL; i <= sqrt(num); i++)
        {
                if (num % i == 0)
                {
                        return 0;
                }
        }
        return 1;
}

The problem is that I want openmp to automatically create a number of threads based on how many cores are available. If I take out the num_threads(6), then it just uses 1 thread yet the omp_get_num_procs() correctly outputs 64.

How do I get this to work?

Upvotes: 0

Views: 1278

Answers (4)

user2088790
user2088790

Reputation:

The only problem I see with your code is that when you do the output you need to put it in a critcal section otherwise multiple threads can write to the same line at the same time.
See my code corrections.

In terms of one thread I think what you might be seeing is due to using dynamic. A thread running over small numbers is much quicker then one running over large numbers. When the thread with small numbers finishes and gets another list of small numbers to run it finishes again quick while the thread with large numbers is still running. This does not mean you're only running one thread though. In my output I see long streams of the same thread finding primes but eventually others report as well. You have also set the chuck size to 1000 so if you for example only ran over 1000 numbers only one thread will be used in the loop.

It looks to me like you're trying to find a list of primes or a sum of the number of primes. You're using trial division for that. That's much less efficient than using the "Sieve of Eratosthenes".

Here is an example of the Sieve of Eratosthenes which finds the primes in the the first billion numbers in less than one second on my 4 core system with OpenMP. http://create.stephan-brumme.com/eratosthenes/

I cleaned up your code a bit but did not try to optimize anything since the algorithm is inefficient anyway.

int main() {
    //long long int n = 1000000000000;
    long long int n = 1000000;
    cout << "There are " << omp_get_num_procs() << " cores." << endl;
    double dtime = omp_get_wtime();
    #pragma omp parallel
    {
        #pragma omp for schedule(dynamic)
        for(long long i = 3LL; i <= n; i = i + 2LL) {
            if(isprime(i)) {
                #pragma omp critical 
                {
                    cout << i << "\tThread: " << omp_get_thread_num() << endl;
                }
            }
        }
    }
    dtime = omp_get_wtime() - dtime;
    cout << "time " << dtime << endl;
}

Upvotes: 0

Tom Scogland
Tom Scogland

Reputation: 938

You neglected to mention which compiler and OpenMP implementation you are using. I'm going to guess you're using one of the ones, like PGI, which does not automatically assume the number of threads to create in a default parallel region unless asked to do so. Since you did not specify the compiler I cannot be certain that these options will actually help you, but for PGI's compilers the necessary option is -mp=allcores when compiling and linking the executable. With that added, it will cause the system to create one thread per core for parallel regions which do not specify the number of threads or have the appropriate environment variable set.

The number you're getting from omp_get_num_procs is used by default to set the limit on the number of threads, but not necessarily the number created. If you want to dynamically set the number created, set the environment variable OMP_NUM_THREADS to the desired number before running your application and it should behave as expected.

Upvotes: 1

Jerry Coffin
Jerry Coffin

Reputation: 490623

Unless I'm rather badly mistaken, OpenMP normally serializes I/O (at least to a single stream) so that's probably at least part of where your problem is arising. Removing that from the loop, and massaging a bit of the rest (not much point in working at parallelizing until you have reasonably efficient serial code), I end up with something like this:

#include <iostream>
#include <math.h>
#include <omp.h>

using namespace std;

bool isprime(long long num);

int main()
{
    unsigned long long total = 0;

    cout << "There are " << omp_get_num_procs() << " cores.\n";

    #pragma omp parallel for reduction(+:total)
    for(long long i = 3LL; i < 100000000; i += 2LL)
        if(isprime(i))
            total += i;

    cout << "Total: " << total << "\n";
}

bool isprime(long long num) {
    if (num == 2)
        return 1;
    if(num == 1 || num % 2 == 0)
        return 0;
    unsigned long long limit = sqrt(num);

    for(long long i = 3LL; i <= limit; i+=2)
        if (num % i == 0)
            return 0;
    return 1;
}

This doesn't print out the thread number, but timing it I get something like this:

Real    78.0686
User    489.781
Sys     0.125

Note the fact that the "User" time is more than 6x as large as the "Real" time, indicating that the load is being distributed across the cores 8 available on this machine with about 80% efficiency. With a little more work, you might be able to improve that further, but even with this simple version we're seeing considerably more than one core being used (on your 64-core machine, we should see at least a 50:1 improvement over single-threaded code, and probably quite a bit better than that).

Upvotes: 0

Simon Righley
Simon Righley

Reputation: 4969

I'm not sure if I understand your question correctly, but it seems that you are almost there. Do you mean something like:

#include <omp.h>
#include <iostream>

int main(){

    const int num_procs = omp_get_num_procs();
    std::cout<<num_procs;

#pragma omp parallel for num_threads(num_procs) default(none)
    for(int i=0; i<(int)1E20; ++i){
    }

    return 0;

}

Upvotes: 0

Related Questions