Sorting in parallel performance

Question

I tried to run some programs with multicore and kinda confused by the results. By default sorting in program below takes 20 seconds, when I run it with +RTS -N2 it takes around 16 secs, but with +RTS -N4 it takes 21 second!

Why it is like that? And is there example of program that gets faster with each extra core? (had similar results with other programs in tutorials)

Here's example of program:

import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let a = qsort $ filter (<=x) xs
        b = qsort $ filter (>x)  xs
    in b `par` a ++ x:b
qsort [] = []


randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
  where lcg x = (a * x + c) `rem` m
        a = 1664525
        c = 1013904223
        m = 2^32

main :: IO ()
main = do
  let randints = randomList 5000000
  t1 <- getPOSIXTime
  print . sum $ qsort randints
  t2 <- getPOSIXTime
  putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "
"

K. A. Buhr · Accepted Answer

I can't duplicate your results. (Which is a good thing, since I think I was the one claiming to see a performance improvement with -N2 and -N4 with the code you posted.)

On Linux with GHC 8.8.3, and compiling to a standalone executable with -O2 -threaded, I get the following timings on a 4-core desktop:

$ stack ghc -- --version
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
The Glorious Glasgow Haskell Compilation System, version 8.8.3
$ stack ghc -- -O2 -threaded QuickSort3.hs
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
[1 of 1] Compiling Main             ( QuickSort3.hs, QuickSort3.o )
Linking QuickSort3 ...
$ ./QuickSort3 +RTS -N1
10741167410134688
SORT TIME: 7.671760902s

$ ./QuickSort3 +RTS -N2
10741167410134688
SORT TIME: 5.700858877s

$ ./QuickSort3 +RTS -N3
10741167410134688
SORT TIME: 4.88330669s

$ ./QuickSort3 +RTS -N4
10741167410134688
SORT TIME: 4.93364958s

I get similar results with a 16-core Linux laptop and also similar results with a 4-core Windows virtual machine (also using GHC 8.8.3) running on that laptop.

I can think of a few possible explanations for your results.

First, I don't have a tremendously fast desktop machine, so your timings of 20secs seem suspicious. Is it possible you're doing something like:

$ stack runghc QuickSort3.hs +RTS -N4

If so, this passes the +RTS flags to stack, and then runs the Haskell program in single-threaded mode using the slow byte-code interpreter. In my tests, the sort then takes about 30secs no matter what -Nx flag value I pass.

Second, is it possible you're running this on a virtual machine with a limited number of cores (or an extremely old piece of two-core hardware)? As noted, I tried testing under a Windows virtual machine and got similar results to the Linux version with a 4-core virtual machine but quite erratic results with a 2-core virtual machine (e.g., 11.4, 13.0, and 51.3secs for -N1, -N2, and -N4 respectively, so worse performance for more cores in general, and off-the-charts bad performance for 4 cores).

You could try the following simple parallel sums benchmark, which might scale better:

import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX

randomList :: Int -> Int -> [Int]
randomList seed n = take n $ tail (iterate lcg seed)
  where lcg x = (a * x + c) `rem` m
        a = 1664525
        c = 1013904223
        m = 2^32

main :: IO ()
main = do
  t1 <- getPOSIXTime
  let n = 50000000
      a = sum $ randomList 1 n
      b = sum $ randomList 2 n
      c = sum $ randomList 3 n
      d = sum $ randomList 4 n
      e = sum $ randomList 5 n
      f = sum $ randomList 6 n
      g = sum $ randomList 7 n
      h = sum $ randomList 8 n
  print $ a `par` b `par` c `par` d `par` e `par` f `par` g `par` h `par` (a+b+c+d+e+f+g+h)
  t2 <- getPOSIXTime
  putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "
"

Sorting in parallel performance

Answers (1)

Related Questions