pad
pad

Reputation: 41290

Best Practices for cache locality in Multicore Parallelism in F#

I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.

I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.

Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.

Upvotes: 26

Views: 3423

Answers (6)

Adrian
Adrian

Reputation: 2364

As far as I can make out, the key to cache locality (multithreaded or otherwise) is

  • Keep work units in a contiguous block of RAM that will fit into the cache

To this end ;

  • Avoid objects where possible
    • Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
    • You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
  • Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
    • Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
  • Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
  • Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
  • Work out the size of the cache on the machine you'll be executing it on
    • CPUs have different size L2 caches
    • It might be prudent to design your code to scale with different cache sizes
    • Or more simply, write code that will fit inside the lowest common cache size your code will be running on
  • Work out what needs to sit close to each datum
    • In practice, you're not going to fit your whole working set into the L2 cache
    • Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.

In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.

A good academic paper on the subject is Cache-Efficient String Sorting Using Copying

Upvotes: 26

7sharp9
7sharp9

Reputation: 2167

A great approach is to split the work into smaller sections and iterate over each section on each core.

One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.

Heres a great example of that: Cache Locality For Performance

There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.

Upvotes: 3

Alois Kraus
Alois Kraus

Reputation: 13545

To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again. As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.

If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.

To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.

You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.

Upvotes: 2

Joh
Joh

Reputation: 2380

I am no parallelism expert, but here is my advice anyway.

  1. I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
  2. Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.

Upvotes: 3

GregC
GregC

Reputation: 8007

Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.

Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.

To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.

Upvotes: 3

Related Questions