Performance issue with parallel computation in Haskell

Question

I'm comparing the performance of two haskell programs running the same computation.

The first one is sequential:

main :: IO()
main = putStr $ unlines . map (show . solve) $ [100..107]
  where solve x = pow x (10^7) (982451653)

The second one uses Control.Parallel.Strategies:

import Control.Parallel.Strategies

main :: IO()
main = putStr $ unlines . parMap rdeepseq (show . solve) $ [100..107]
  where solve x = pow x (10^7) (982451653)

In both cases, pow is the modular exponentiation naively implemented as:

pow :: Int -> Int -> Int -> Int
pow a 0 m = 1
pow a b m = a * (pow a (b-1) m) `mod` m

The sequential program runs in about 3 seconds using, as expected, 100% CPU.

$ stack ghc seq.hs -- -O2
$ 	ime -f "%e s - %P" ./seq > /dev/null
2.96 s - 100%

The parallel program also runs in about 3 seconds using 100% CPU when limited to a single core.

$ stack ghc par.hs -- -O2 -threaded
$ 	ime -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.14 s - 99%

But when I ran it on 4 cores, I did not observe the performance gain I was expected:

$ 	ime -f "%e s - %P" ./par +RTS -N4 > /dev/null
3.31 s - 235%

Even more surprising, the sequential program uses more than 100% CPU when run on several cores:

$ stack ghc seq.hs -- -O2 -threaded
$ 	ime -f "%e s - %P" ./seq +RTS -N4 > /dev/null
3.26 s - 232%

How can those results be explained?

EDIT - As advised by @RobertK and @Yuras, I replaced the rdeeseq by rpar and it did fix the initial issue. However, the performance is still much less than what I expected:

$ stack ghc par.hs -- -O2 -threaded
$ 	ime -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.12 s - 99%
$ 	ime -f "%e s - %P" ./par +RTS -N4 > /dev/null
1.91 s - 368%

The execution time is barely divided by two even though the 4 cores are running more than 90% of the time on average.

Also, some parts of the threadscope graph look very sequential:

Yuras · Accepted Answer

First of all, rdeepseq seems to be buggy. Try to run ./seq +RTS -N4 -s, and you'll see no sparks created. That is why you don't see any speedup on 4 cores. Use rnf x ‘pseq‘ return x instead.

Also note GC statictics in +RTS -s output. Actually GC takes most of the CPU. With -N4 you have 4 parallel GC running, they take more time. That is why sequencial progral takes much more CPU on 4 cores. Basically you have 3 GC threads idle in spin lock waiting for synchronization. The do nothing useful, by eat CPU in a busy loop. Try to limit number of parallel GC threads using -qn1 option.

Regarding performance gain. You should not expect perfect scaling. Also I think you have 1 fizzled spark -- it is evaluated in parallel, but its result is not used.

Added: Comparing with the python implementation you linked in the comments, I see that you are using completely different algorithm in haskell. More or less similar approach is the next (requires BangPatterns):

pow :: Int -> Int -> Int -> Int
pow a b m = go 1 b
  where
  go !r 0 = r
  go r b' = go ((r * a) `mod` m) (pred b')

Your ariginal algorithm uses stack to build the result, so it is bound by GC, not by actuall computation. So you don't see big speedup. With new one I see 3x speedup (I had to increase amount of work to see the speedup because the algorithm becomes too slow).

Performance issue with parallel computation in Haskell

Answers (2)

Related Questions