Eben Kadile
Eben Kadile

Reputation: 779

Haskell - What's up with parMap?

I've run some tests:

import Control.Parallel.Strategies
import Data.Vector as V
import Data.Maybe

parMapVec :: (a -> b) -> Vector a -> Vector b
parMapVec f v = runEval $ evalTraversable rpar $ V.map f v

range :: Integer -> Integer -> Vector Integer
range x y
  | x == y = x `cons` empty
  | x < y  = x `cons` (range (x + 1) y)
  | x > y  = (range x (y + 1)) `snoc` y

fac :: Integer -> Integer
fac n
  | n < 2     = 1
  | otherwise = n * (fac $ n - 1)

main :: IO ()
main = do
  let result = runEval $ do
        let calc = parMapVec fac $ 80000 `range` 80007
        rseq calc
        return calc
  putStrLn $ show result

As well as with the following modification to main to make sure that my parMapVector wasn't what was wrong:

main = do
  let result = runEval $ do
        let calc = parMap rpar fac [80000..80007]
        rseq calc
        return calc
  putStrLn $ show result

I compiled with gch --make parVectorTest.hs -threaded -rtsopts and ran both with ./parVectorTest -s.

Here's what I found with the version with vectors:

56,529,547,832 bytes allocated in the heap
10,647,896,984 bytes copied during GC
    7,281,792 bytes maximum residency (16608 sample(s))
    3,285,392 bytes maximum slop
            21 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
Gen  0     82708 colls,     0 par    0.828s   0.802s     0.0000s    0.0016s
Gen  1     16608 colls,     0 par   15.006s  14.991s     0.0009s    0.0084s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 8 (7 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

INIT    time    0.001s  (  0.001s elapsed)
MUT     time    5.368s  (  5.369s elapsed)
GC      time   15.834s  ( 15.793s elapsed)
EXIT    time    0.001s  (  0.000s elapsed)
Total   time   21.206s  ( 21.163s elapsed)

Alloc rate    10,530,987,847 bytes per MUT second

Productivity  25.3% of total user, 25.4% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

So that's good, except that I watched the process execute on my system monitor, and only one core was working at a time. Every time one of the results was printed out, the process would switch to a different core. So I thought it was something wrong with my parMapVec function. But then I did the same thing except with the version with lists:

56,529,535,488 bytes allocated in the heap
12,483,967,024 bytes copied during GC
    6,246,872 bytes maximum residency (19843 sample(s))
    2,919,544 bytes maximum slop
            20 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
Gen  0     79459 colls,     0 par    0.818s   0.786s     0.0000s    0.0009s
Gen  1     19843 colls,     0 par   17.725s  17.709s     0.0009s    0.0087s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 16 (14 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)

INIT    time    0.001s  (  0.001s elapsed)
MUT     time    5.394s  (  5.400s elapsed)
GC      time   18.543s  ( 18.495s elapsed)
EXIT    time    0.000s  (  0.000s elapsed)
Total   time   23.940s  ( 23.896s elapsed)

Alloc rate    10,479,915,927 bytes per MUT second

Productivity  22.5% of total user, 22.6% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

So there was more garbage collection, which makes sense. And there was also more sparks, which I don't know how to explain. This program exhibited the same behavior when I watched it execute on my system monitor.

I also ran both tests with ./parVector -s -C0.01 because of the answer to this question and got basically the same results. I'm on a Lenovo Ideapad, 8 cores, running Ubuntu Linux 17.04. At the time of the tests, the only apps I had open were VS Code and my system monitor, although there other processes taking up a very small portion of the processing power. Does a processor have to be completely idle to take a spark?

Upvotes: 2

Views: 404

Answers (1)

user2407038
user2407038

Reputation: 14578

By default, GHC runs all programs using a single OS thread, even with -threaded enabled. Note the text "using -N1" in your output - it indicates that the program is being run with 1 physical thread.

In short: pass e.g. +RTS -N8 to your program. For documentation of this flag, see here.


In a broad sense, this is due to the distinction between parallelism and concurrency. Here are some SO questions which try to explain the difference. The difference can be summarized as:

  • parrallelism: a task subdivided into similar chunks to run simultaneously on separate cores/CPUs at some point in time; for increased speed

  • concurrency: several tasks being executed conceptually independently such that their execution times overlap, whether on the same thread through time slicing or on separate cores/CPUs; usually utilizing shared resources more efficiently

However, these definitions are somewhat contentious; sometimes the two have opposite meanings, and sometimes they are used interchangeably. However, for the purpose of understanding this problem (why you must pass another flag in addition to -threaded to make a 'parallel' program actually run in parallel) I believe they are useful definitions.

Upvotes: 6

Related Questions