pushkin
pushkin

Reputation: 10247

OpenMP's mechanism for spreading threads out evenly

OpenMP tries to spread out threads across the cores as evenly as possible, but how does that work?

Ultimately, the OS is deciding how to spread them. Does OpenMP simply recommend to the OS to do that (similar to using the likely macro or register keyword in C).

If we're running a job with num_threads threads on a machine with num_cores cores, none of which are currently in use, is it fair to assume that the threads will be spread out across all cores evenly (and assuming num_threads <= num_cores, you have pure parallelism), since the OS should be working in our best interest and spreading the load nicely.

I see graphs of strong scaling where the x axis is # cores. Do we then assume that the maximum number of threads they used to run the job is <= the number of cores and that the cores were relatively idle?

Or is all of this a moot point.

Upvotes: 0

Views: 281

Answers (1)

Gilles
Gilles

Reputation: 9519

The scheduling of the OpenMP threads on the cores and/or hardware threads of the machine is mostly the responsibility of the operating system. It will decide based on its own heuristics where and when to start / stop / migrate them...

However, OpenMP gives you some tools to direct / restrict the span of choices the OS has for taking its decisions. For example, you have access to:

  • The number of OpenMP threads to launch on a parallel region: OMP_NUM_THREADS environment variable, num_threads clause, omp_set_num_threads() function
  • The logical cores where the threads can be scheduled by the OS: OMP_PLACES environment variable.
  • The optional pinning policy for the threads: OMP_PROC_BIND environment variable, proc_bind clause.

With that, you have some level of control to steer the OS decisions, but ultimately, it remains in control of the actual scheduling. And the decisions it will take are not always what you would have thought (especially when you don't use placement or binding) since the machine workload and the global scheduling policy it applies might interfere with what you think would have been optimal for your code. For example, on a NUMA (Non-Uniform Memory Access) machine, considerations such as the memory used on the various nodes and which memory segment belongs to which process might prevent from a seemingly even spreading of threads across chips, leading to CPU local contentions...

Upvotes: 1

Related Questions