Reputation: 111
I have a Racket program that will be long running. Executing many instances of the same programs will help finding the answer faster. (It depends on the randomness.) So I execute 10 instances of the same program from the command line on a 24-core machine. The average throughput when executing one instance (on one core) is 500 iterations/s. The average throughput when executing 10 instances (on 10 cores) goes down to 100 iterations/s per core. I expect to see similar throughput per core because each execution does not interface with the others at all. Does anyone else experience this behavior? What is happening? How can I fix this?
--------------------------- Additional information -----------------------------
OS: ubuntu 13.10 cores: 24
Each instance write its own output file. Approximately once per minute, each instance will replace the same output file with the updated result which is about 10 lines of text. So, I don't think they hit I/O bound.
According to top, each core uses 1.5-2.5% of memory. When running 10 core, 16 GB is used and, 9 GB is free. With nothing running, 11 GB is used, and 14 GB are free.
There is no network request.
The follows are (current-memory-use) divided by 1,000,000 over 12 minutes on 3 of the 10 cores (MB).
When I run (current-memory-use) without anything else, it returns 29 MB.
Upvotes: 0
Views: 112
Reputation: 111
I found the issue. My program indeed used too much memory. Therefore, when I'm running multiple instances at the same time, either everything can't fit in cache (probably L3) or it exceeds memory bandwidth.
I tried to discover the source of the problem why my program used so much memory. By putting (current-memory-use) at many places in the program, I found that the issue was from arithmetic-shift. Because of that one operation, somehow the memory usage became doubled immediately.
The problem occured when executing (arithmetic-shift x y) when x is big and y is positive. In that case, I believe the result is represented using "flonum" (boxed) instead of "fixnum" (unboxed).
Even though I masked the result to 32-bit later, something prevented racket from optimizing that, likely first-order functions. I fixed it by masking x before passing it to arithmetic-shift such that the result is never greater than 32-bit number, and that fixed the problem. Now, my program uses 80 MB instead of 300 MB, and I get the speed up I expect!
Upvotes: 1
Reputation: 16260
I suppose this isn't truly an answer; it's more like a guess and advice that doesn't fit in a comment.
From the list @MarkSetchell gave, the most obvious place to start is I/O -- do the processes make network requests, or share an input file?
Slightly less obvious (but, wild guess, more likely in your case) is memory. The sole instance could use all available RAM, if needed. Does it?. With 10 instances sharing the same RAM, they'd probably garbage collect more often, which would be slower.
Try adding something like
(thread
(λ ()
(let loop ()
(displayln (current-memory-use))
(sleep 5)
(loop))))
and see how that plots over time. For one instance, does it top out at a value? How does that compare to RAM in the system?
And/or, use racket -W "error debug@GC" <your-program>
to show debug-level log info from the GC.
Upvotes: 0