Parallel Programming and C++

I've been writing a lot recently about Parallel computing and programming and I do notice that there are a lot of patterns that come up when it comes to parallel computing. Noting that Microsoft already has released a library along with the Microsoft Visual C++ 2010 Community Technical Preview (named Parallel Patterns Library) I'm wondering what are the common parallel programming patterns you have been using and encountering that may be worth remembering? Do you have any idioms you follow and patterns that you seem to keep popping up as you write parallel programs with C++?

Upvotes: 8

Answers (4)

Beached

Reputation: 1688

You have the basics that bolt on parallelism to parts of a program. C++17 is getting many of them (e.g. parallel versions of foreach, sort, find and friends, map_reduce, map, reduce, prefix_sum ... ) see C++ Extensions for Parallelism

Then you have the items like continuations. Think std::future but with continues. There are few ways of implementing these(boost has a good one now as the std one does not have a next(...) or then(...) method, but the big benefit is that one does not have to wait on it to do the next task

auto fut = async([]( ){..some work...} ).then( [](result_of_prev ){...more work} ).then... ;
fut.wait( );

The lack of synchronisation between the subsequent tasks is important as communication between tasks/threads/... is what slows down parallel programs.

So with that task based parallelism is really nice. With a task scheduler you just pass tasks off and walk away. They may have methods, like a semaphore, to communicate back but that is not mandatory. Both Intel Thread Building Blocks and Microsoft Parallel Pattern Library have facilities for this.

After that we have the fork/join pattern. It does not imply create N threads for each task. Just that you have these N, ideally independent, things to do(fork) and when they are done have a synchronisation point somewhere(join).

auto semaphore = make_semaphore( num_tasks );
add_task( [&semaphore]( ) {...task1...; semaphore.notify( ); } );
add_task( [&semaphore]( ) {...task2...; semaphore.notify( ); } );
...
add_task( [&semaphore]( ) {...taskN...; semaphore.notify( ); } );
semaphore.wait( );

From above you can start to see the pattern that this is a flow graph. Future is (A >> B >> C >> D) and Fork Join is (A|B|C|D). With that you can combine them to form a graph. (A1>>A2|B1>>B2>>B3|C1|D1>>D2>>(E1>>E2|F1)) Where A1>>A2 means A1 must preceed A2 and A|B means that A and B can run concurrently. The slow parts are at the end of the graphs/subgraphs where things meet up.

The goal is to find independent parts of the system that do not need to communicate. The parallel algorithms, as noted above, are in almost all cases slower than their sequential counterparts until the work load gets high enough or the size gets big enough(assuming communication isn't too chatty). For example sorting. On a 4 core computer you will get around 2.5X the performance give or take because the merge is chatty and requires a lot of synchronisation and does not work all the cores after the first merge round. On a GPU with an N that is very large, one can use a less efficient sort, like Bitonic, and it ends up being very fast because you have a lot of workers to power through the work and everyone is quiet doing their own thing.

Some tricks to reduce commmunication include, using an array for results so that each task isn't trying to lock an object in order to push a value. Often later the reducing of these results can be very quick.

But with all types of parallelism the slowness comes from communication. Reduce it.

Upvotes: 1

Nader KHAMMASSI

Reputation: 21

Parallel Execution Patterns

Structured parallel programming with deterministic patterns is a high-level approach mainly based on a collection of recurrent parallel execution patterns, often referred to as algorithmic skeletons or parallel constructs, which abstract program description and hide low-level multithreading details and many complexities inherent in parallelism from the programmers .

These reusable patterns automate many parallel paradigm-related routines such as synchronization, communication, data partitioning or task scheduling and handle them internally. This high-level approach attempts traditional low-level thread lock model with more abstraction and an easier way to express parallelism and focus productivity and programmability rather than performance.

There's many commonly used patterns such as: Map-Reduce, Fork-Join, Pipeline or Parallel Loop...

Papers

"Structured Parallel Programming with Deterministic Patterns" is a paper which discuss these patterns. You can see also "MHPM: Multi-Scale Hybrid Programming Model: A Flexible Parallelization Methodology" which describe a C++ implementation of this approach named XPU.

Library

XPU is a task-based C++ library composed from a set of reusable execution patterns. It allows expression of several types of parallelism at several levels of granularity inside a single homogeneous programming model. It's easy to use and illustrates the intrest in using patterns to design parallel programs.

For example it allows expression of :

Task Parallelism Pattern:

Simple or Hierarchical Fork/Join execution pattern with some features such as automatic detection and protection of shared data.
Data Parallelism Pattern:

Parallel loop pattern with scalable data partitioning.
Temporal Parallelism Pattern:

Pipeline execution pattern.

Upvotes: 2

Loki Astari

Reputation: 264621

Patterns:

Produce/Consumer
- One Thread produces data
- One Thread consumes the data
Loop parallelism
- If you can show that each loop is independent
  each iteration can be done in a sperate thread
Re-Draw Thread
- Other threads do work and update data structures but one thread re-draws screen.
Main-Event Thread
- Multiple threads can be generating events
- One thread has to processes the events (as order is important)
- Should try separate the Event Thread/Re-Draw Thread
  This (helps) prevents the UI from freezing
  But may cause excessive re-draws if not done carefully.
Work Group
- A set of threads waits for jobs on a que.
- Thread extract one work item from queue (waiting if none is available).
  Thread works on one work item until complete
  Once completed thread returns to queue.

Upvotes: 17

Martin v. Löwis

Reputation: 127527

First you have to chose between shared-memory computing, and shared-nothing computing. Shared memory is easier, but doesn't scale that well - you will use shared-nothing if you either

a) have a cluster, rather than a multiprocessor system, or

b) if you have many CPUs (say, > 60), and a high degree of non-uniform memory

For shared-memory, the common solution is to use threads; they are easy to understand as a concept, and easy to use in the API (but difficult to debug).

For shared-nothing, you use some kind of messaging. In high-performance computing, MPI is established as the messaging middleware.

You then also need to design an architecture for the parallel activities. The most common approach (again because it's easy to understand) is the farmer-worker-pattern (a.k.a. master-slave).

Upvotes: 2

Parallel Programming and C++

Answers (4)

Related Questions