Reputation: 779
I've run some tests:
import Control.Parallel.Strategies
import Data.Vector as V
import Data.Maybe
parMapVec :: (a -> b) -> Vector a -> Vector b
parMapVec f v = runEval $ evalTraversable rpar $ V.map f v
range :: Integer -> Integer -> Vector Integer
range x y
| x == y = x `cons` empty
| x < y = x `cons` (range (x + 1) y)
| x > y = (range x (y + 1)) `snoc` y
fac :: Integer -> Integer
fac n
| n < 2 = 1
| otherwise = n * (fac $ n - 1)
main :: IO ()
main = do
let result = runEval $ do
let calc = parMapVec fac $ 80000 `range` 80007
rseq calc
return calc
putStrLn $ show result
As well as with the following modification to main
to make sure that my parMapVector
wasn't what was wrong:
main = do
let result = runEval $ do
let calc = parMap rpar fac [80000..80007]
rseq calc
return calc
putStrLn $ show result
I compiled with gch --make parVectorTest.hs -threaded -rtsopts
and ran both with ./parVectorTest -s
.
Here's what I found with the version with vectors:
56,529,547,832 bytes allocated in the heap
10,647,896,984 bytes copied during GC
7,281,792 bytes maximum residency (16608 sample(s))
3,285,392 bytes maximum slop
21 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 82708 colls, 0 par 0.828s 0.802s 0.0000s 0.0016s
Gen 1 16608 colls, 0 par 15.006s 14.991s 0.0009s 0.0084s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 8 (7 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 5.368s ( 5.369s elapsed)
GC time 15.834s ( 15.793s elapsed)
EXIT time 0.001s ( 0.000s elapsed)
Total time 21.206s ( 21.163s elapsed)
Alloc rate 10,530,987,847 bytes per MUT second
Productivity 25.3% of total user, 25.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
So that's good, except that I watched the process execute on my system monitor, and only one core was working at a time. Every time one of the results was printed out, the process would switch to a different core. So I thought it was something wrong with my parMapVec
function. But then I did the same thing except with the version with lists:
56,529,535,488 bytes allocated in the heap
12,483,967,024 bytes copied during GC
6,246,872 bytes maximum residency (19843 sample(s))
2,919,544 bytes maximum slop
20 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 79459 colls, 0 par 0.818s 0.786s 0.0000s 0.0009s
Gen 1 19843 colls, 0 par 17.725s 17.709s 0.0009s 0.0087s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 16 (14 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 5.394s ( 5.400s elapsed)
GC time 18.543s ( 18.495s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 23.940s ( 23.896s elapsed)
Alloc rate 10,479,915,927 bytes per MUT second
Productivity 22.5% of total user, 22.6% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
So there was more garbage collection, which makes sense. And there was also more sparks, which I don't know how to explain. This program exhibited the same behavior when I watched it execute on my system monitor.
I also ran both tests with ./parVector -s -C0.01
because of the answer to this question and got basically the same results. I'm on a Lenovo Ideapad, 8 cores, running Ubuntu Linux 17.04. At the time of the tests, the only apps I had open were VS Code and my system monitor, although there other processes taking up a very small portion of the processing power. Does a processor have to be completely idle to take a spark?
Upvotes: 2
Views: 404
Reputation: 14578
By default, GHC runs all programs using a single OS thread, even with -threaded
enabled. Note the text "using -N1" in your output - it indicates that the program is being run with 1 physical thread.
In short: pass e.g. +RTS -N8
to your program. For documentation of this flag, see here.
In a broad sense, this is due to the distinction between parallelism and concurrency. Here are some SO questions which try to explain the difference. The difference can be summarized as:
parrallelism: a task subdivided into similar chunks to run simultaneously on separate cores/CPUs at some point in time; for increased speed
concurrency: several tasks being executed conceptually independently such that their execution times overlap, whether on the same thread through time slicing or on separate cores/CPUs; usually utilizing shared resources more efficiently
However, these definitions are somewhat contentious; sometimes the two have opposite meanings, and sometimes they are used interchangeably. However, for the purpose of understanding this problem (why you must pass another flag in addition to -threaded
to make a 'parallel' program actually run in parallel) I believe they are useful definitions.
Upvotes: 6