OpenMP and OOP (Molecular Dynamics Simulation)

Question

I’m conducting a molecular dynamics simulation, and I’ve been struggling for quite a while to implement it in parallel, and although I succeeded in fully loading my 4-thread processor, the computation time in parallel is greater than the computation time in serial mode.

Studying at which point of time each thread starts and finishes its loop iteration, I’ve noticed a pattern: it’s as if different threads are waiting for each other. It was then that I turned my attention to the structure of my program. I have a class, an instance of which represents my system of particles, containing all the information about particles and some functions that use this information. I also have a class instance of which represents my interatomic potential, containing parameters of potential function along with some functions (one of those functions calculates force between two given particles).

And so in my program there exist instances of two different classes, and they interact with each other: some functions of one class take references to instances of another class. And the block I’m trying to implement in parallel looks like this:

      void Run_simulation(Class_system &system, Class_potential &potential, some other arguments){
          #pragma omp parallel for
              for(…) 
      }

for(...) is the actual computation, using data from the system instance of the Class_system class and some functions from thepotential instance of the Class_potential class.

Am I right that it’s this structure that’s the source of my troubles?

Could you suggest me what has to be done in this case? Must I rewrite my program in completely different manner? Should I use some different tool to implement my program in parallel?

Hristo Iliev · Accepted Answer

Without further details on your simulation type I can only speculate, so here are my speculations.

Did you look into the issue of load balancing? I guess the loop distributes the particles among threads but if you have some kind of a restricted range potential, then the computational time might differ from particle to particle in the different regions of the simulation volume, depending on the spatial density. This is a very common problem in molecular dynamics and one that is very hard to solve properly in distributed memory (MPI in most cases) codes. Fortunately with OpenMP you get direct access to all particles at each computing element and so the load balancing is much easier to achieve. It is not only easier, but it is also built-in, so to speak - simply change the scheduling of the for directive with the schedule(dynamic,chunk) clause, where chunk is a small number whose optimal value might vary from simulation to simulation. You might make chunk part of the input data to the program or you might instead write schedule(runtime) and then play with different scheduling classes by setting the OMP_SCHEDULE environment variable to values like "static", "dynamic,1", "dynamic,10", "guided", etc.

Another possible source of performance degradation is false sharing and true sharing. False sharing occurs when your data structure is not suitable for concurrent modification. For example, if you keep 3D positional and velocity information for each particle (let's say you use velocity Verlet integrator), given IEEE 754 double precision, each coordinate/velocity triplet takes 24 bytes. This means that a single cache line of 64 bytes accommodates 2 complete triplets and 2/3 of another one. The consequence of this is that no matter how you distribute the particles among the threads, there would always be at lest two threads that would have to share a cache line. Suppose that those threads run on different physical cores. If one thread writes to its copy of the cache line (for example it updates the position of a particle), the cache coherency protocol would be involved and it will invalidate the cache line in the other thread, which would then have to reread it from a slower cache of even from main memory. When the second thread update its particle, this would invalidate the cache line in the first core. The solution to this problem comes with proper padding and proper chunk size choice so that no two threads would share a single cache line. For example, if you add a superficial 4-th dimension (you can use it to store the potential energy of the particle in the 4-th element of the position vector and the kinetic energy in the 4-th element of the velocity vector) then each position/velocity quadruplet would take 32 bytes and information for exactly two particles would fit in a single cache line. If you then distribute an even number of particles per thread, you automatically get rid of possible false sharing.

True sharing occurs when threads access concurrently the same data structure and there is an overlap between the parts of the structure, modified by the different threads. In molecular dynamics simulations this occurs very frequently as we want to exploit the Newton's third law in order to cut the computational time in two when dealing with pairwise interaction potentials. When one thread computes the force acting on particle i, while enumerating its neighbours j, computing the force that j exerts on i automatically gives you the force that i exerts on j so that contribution can be added to the total force on j. But j might belong to another thread that might be modifying it at the same time, so atomic operations have to be used for both updates (both, sice another thread might update i if it happens to neighbour one of more of its own particles). Atomic updates on x86 are implemented with locked instructions. This is not that horribly slow as often presented, but still slower than a regular update. It also includes the same cache line invalidation effect as with false sharing. To get around this, at the expense of increased memory usage one could use local arrays to store partial force contributions and then perform a reduction in the end. The reduction itself has to either be performed in serial or in parallel with locked instructions, so it might turn out that not only there is no gain from using this approach, but rather it could be even slower. Proper particles sorting and clever distribution between the processing elements so to minimise the interface regions can be used to tackle this problem.

One more thing that I would like to touch is the memory bandwidth. Depending on your algorithm, there is a certain ratio between the number of data elements fetched and the number of floating point operations performed at each iteration of the loop. Each processor has only a limited bandwidth available for memory fetches and if it happens that your data does not quite fit in the CPU cache, then it might happen that the memory bus is unable to deliver enough data to feed so many threads executing on a single socket. Your Core i3-2370M has only 3 MiB of L3 cache so if you explicitly keep the position, velocity and force for each particle, you can only store about 43000 particles in the L3 cache and about 3600 particles in the L2 cache (or about 1800 particles per hyperthread).

The last one is hyperthreading. As High Performance Mark has already noted, hyperthreads share a great deal of core machinery. For example there is only one AVX vector FPU engine that is shared among both hyperthreads. If your code is not vectorised, you lose a great deal of computing power available in your processor. If your code is vectorised, then both hyperthreads will get into each others way as they fight for control over the AVX engine. Hyperthreading is useful only when it is able to hide memory latency by overlaying computation (in one hyperthread) with memory loads (in another hyperthread). With dense numerical codes that perform many register operations before they perform memory load/store, hyperthreading gives no benefits whatsoever and you'd be better running with half the number of threads and explicitly binding them to different cores as to prevent the OS scheduler from running them as hyperthreads. The scheduler on Windows is particularly dumb in this respect, see here for an example rant. Intel's OpenMP implementation supports various binding strategies controlled via environment variables. GNU's OpenMP implementation too. I am not aware of any way to control threads binding (a.k.a. affinity masks) in Microsoft's OpenMP implementation.

OpenMP and OOP (Molecular Dynamics Simulation)

Answers (1)

Related Questions