Reputation: 2989

How to optimize for dual, quad and higher multi-processors?

Folks, I've been programming high speed software over 20 years and know virtually every trick in the book from micro-bench making cooperative, profiling, user-mode multitasking, tail recursion, you name it for very high performance stuff on Linux, Windows, and more.

The problem is that I find myself befuddled by what happens when multiple threads of CPU intensive work are exposed to a multi-core processors.

The results from performance in micro benchmarks of various ways of sharing date between threads (on different cores) don't seem to follow logic.

It's clear that there is some "hidden interaction" between the cores which isn't obvious from my own programming code. I hear of L1 cache and other issues but those are opaque to me.

Question is: Where can I learn this stuff ? I am looking for an in depth book on how multi-core processors work, how to program to capitalize on their memory caches or other hardware architecture instead of being punished by them.

Any advice or great websites or books? After much Googling, I'm coming up empty.

Sincerely, Wayne

Upvotes: 11

Answers (6)

JasonS

Reputation: 7733

my own question, with answer, on stackoverflow's sister site: https://softwareengineering.stackexchange.com/questions/126986/where-can-i-find-an-overview-of-known-multithreading-design-patterns/126993#126993

I will copy the answer to avoid the need for click-through:

Quote Boris:

Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures https://rads.stackoverflow.com/amzn/click/0735651590

This is a book, I recommend wholeheartedly.

It is:

New - published last year. Means you are not reading somewhat outdated practices.

Short - about 200+ pages, dense with information. These days there is too much to read and too little time to read 1000+ pages books.

Easy to read - not only it is very well written but it introduces hard to grasps concepts in really simple to read way.

Intended to teach - each chapter gives exercises to do. I know it is always beneficial to do these, but rarely do. This book gives very compelling and interesting tasks. Surprisingly I did most of them and enjoyed doing them.

additionally, if you wish to learn more of the low-level details, this is the best resource i have found: "The Art of Multiprocessor Programming" It's written using java as their code samples, which plays nicely with my C# background.

PS: I have about 5 years "hard core" parallel programming experience, (abet using C#) so hope you can trust me when I say that "The Art of Multiprocessor Programming" rocks

Upvotes: 4

Olof Forshell

Reputation: 3264

There are different aspects to multi-threading requiring different approaches.

On a webserver, for example, the use of thread-pools is widely used since it supposedly is "good for" performance. Such pools may contain hundreds of threads waiting to be put to work. Using that many threads will cause the scheduler to work overtime which is detrimental to performance but can't be avoided on Linux systems. For Windows the method of choice is the IOCP mechanism which recommends a number of threads not greater than the number of cores installed. It causes an application to become (I/O completion) event driven which means that no cycles are wasted on polling. The few threads involved reduce scheduler work to a minimum.

If the object is to implement a functionality that is scaleable (more cores <=> higher performance) then the main issue will be memory bus saturation. The saturation will occur due to code fetching, data reading and data writing. An incorrectly implemented code will run slower with two threads than with one. The only way around this is to reduce the memory bus work by actively:

tailoring the code to a minimal memory footprint (= fits in the code cache) and which doesn't call other functions or jump all over the place.
tailoring memory reads and writes to a minimum size.
informing the prefetch mechanism of coming RAM reads.
tailoring the work such that the ratio of work performed inside the core's own caches (L1 & L2) is as great as possible in relation to the work outside them (L3 & RAM).

To put this in another way: fit the applicable code and data chunks into as few cache lines (@ 64 bytes each) as possible because ultimately this is what will decide the scaleability. If the cache/memory system is capable of x cache line operations every second your code will run faster if its requirements are five cache lines per unit of work (=> x/5) rather than eleven (x/11) or fifty-two (x/52).

Achieving this is not trivial since it requires a more or less unique solution every time. Some compilers do a good job of instruction ordering to take advantage of the host processor's pipelining. This does not necessarily mean that it will be a good ordering for multiple cores.

An efficient implementation of a scaleable code will not necessarily be a pretty one. Recommended coding techniques and styles may, in the end, hinder the code's execution.

My advice is to test how this works by writing a simple multi-threaded application in a low-level language (such as C) that can be adjusted to run in single or multi-threaded mode and then profiling the code for the different modes. You will need to analyze the code at the instruction level. Then you experiment using different (C) code constructs, data organization, etc. You may have to think outside the box and rethink the algorithm to make it more cache-friendly.

The first time will require lots of work. You will not learn what will work for all multi-threaded solutions but you will perhaps get an inkling of what not to do and what indications to look for when analyzing profiled code.

Upvotes: 1

Wayne

Reputation: 2989

I found this link that specifically explains the issues of multicore cache handling on CPUs that was affecting my multithreaded program.

http://www.multicoreinfo.com/research/intel/mem-issues.pdf

The site multicoreinfo.com in general has lots of good information and references about multicore programming.

Upvotes: 0

gjvdkamp

Reputation: 10516

One specific cause of unexpected poor results in parallelized code is false sharing, you won't see that coming if you dont know what's going on down there (I didn't). Here a two articles that dicuss the cause and remedy for .Net:

http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

http://www.codeproject.com/KB/threads/FalseSharing.aspx

Rgds GJ

Upvotes: 2

amit kumar

Reputation: 21022

My answer on "Are you concerned about multicores"
Herb Sutter's articles
Video Series on Parallel Programming

Upvotes: 2

Francis Upton IV

Reputation: 19443

This book taught me a lot about these sorts of issues about why raw CPU power is not necessary the only thing to pay attention to. I used it in grad school years ago, but I think all of the principles still apply:

http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901

And essentially a major issue in multi-process configurations is synchronizing the access to the main memory, if you don't do this right it can be a real bottleneck in the performance. It's pretty complex with the caches that have to be kept in sync.

Upvotes: 4

How to optimize for dual, quad and higher multi-processors?

Answers (6)

Related Questions