Reputation: 133
I have got this seemingly trivial parallel quicksort implementation, the code is as follows:
import System.Random
import Control.Parallel
import Data.List
quicksort :: Ord a => [a] -> [a]
quicksort xs = pQuicksort 16 xs -- 16 is the number of sparks used to sort
-- pQuicksort, parallelQuicksort
-- As long as n > 0 evaluates the lower and upper part of the list in parallel,
-- when we have recursed deep enough, n==0, this turns into a serial quicksort.
pQuicksort :: Ord a => Int -> [a] -> [a]
pQuicksort _ [] = []
pQuicksort 0 (x:xs) =
let (lower, upper) = partition (< x) xs
in pQuicksort 0 lower ++ [x] ++ pQuicksort 0 upper
pQuicksort n (x:xs) =
let (lower, upper) = partition (< x) xs
l = pQuicksort (n `div` 2) lower
u = [x] ++ pQuicksort (n `div` 2) upper
in (par u l) ++ u
main :: IO ()
main = do
gen <- getStdGen
let randints = (take 5000000) $ randoms gen :: [Int]
putStrLn . show . sum $ (quicksort randints)
I compile with
ghc --make -threaded -O2 quicksort.hs
and run with
./quicksort +RTS -N16 -RTS
No matter what I do I can not get this to run faster than a simple sequential implementation running on one cpu.
EDIT: @tempestadept hinted that quick sort it self is the problem. To check this I implemented a simple merge sort in the same spirit as the example above. It has the same behaviour, performs slower the more capabilities you add.
import System.Random
import Control.Parallel
splitList :: [a] -> ([a], [a])
splitList = helper True [] []
where helper _ left right [] = (left, right)
helper True left right (x:xs) = helper False (x:left) right xs
helper False left right (x:xs) = helper True left (x:right) xs
merge :: (Ord a) => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) = case x<y of
True -> x : merge xs (y:ys)
False -> y : merge (x:xs) ys
mergeSort :: (Ord a) => [a] -> [a]
mergeSort xs = pMergeSort 16 xs -- we use 16 sparks
-- pMergeSort, parallel merge sort. Takes an extra argument
-- telling how many sparks to create. In our simple test it is
-- set to 16
pMergeSort :: (Ord a) => Int -> [a] -> [a]
pMergeSort _ [] = []
pMergeSort _ [a] = [a]
pMergeSort 0 xs =
let (left, right) = splitList xs
in merge (pMergeSort 0 left) (pMergeSort 0 right)
pMergeSort n xs =
let (left, right) = splitList xs
l = pMergeSort (n `div` 2) left
r = pMergeSort (n `div` 2) right
in (r `par` l) `pseq` (merge l r)
ris :: Int -> IO [Int]
ris n = do
gen <- getStdGen
return . (take n) $ randoms gen
main = do
r <- ris 100000
putStrLn . show . sum $ mergeSort r
Upvotes: 13
Views: 2308
Reputation: 9767
There are couple of problems that have already been mentioned:
massiv
, rather than lists.scheduler
-- A helper function that partitions a region of a mutable array.
unstablePartitionRegionM ::
forall r e m. (Mutable r Ix1 e, PrimMonad m)
=> MArray (PrimState m) r Ix1 e
-> (e -> Bool)
-> Ix1 -- ^ Start index of the region
-> Ix1 -- ^ End index of the region
-> m Ix1
unstablePartitionRegionM marr f start end = fromLeft start (end + 1)
where
fromLeft i j
| i == j = pure i
| otherwise = do
x <- A.unsafeRead marr i
if f x
then fromLeft (i + 1) j
else fromRight i (j - 1)
fromRight i j
| i == j = pure i
| otherwise = do
x <- A.unsafeRead marr j
if f x
then do
A.unsafeWrite marr j =<< A.unsafeRead marr i
A.unsafeWrite marr i x
fromLeft (i + 1) j
else fromRight i (j - 1)
{-# INLINE unstablePartitionRegionM #-}
Here is the actual in-place quicksort
quicksortMArray ::
(Ord e, Mutable r Ix1 e, PrimMonad m)
=> Int
-> (m () -> m ())
-> A.MArray (PrimState m) r Ix1 e
-> m ()
quicksortMArray numWorkers schedule marr =
schedule $ qsort numWorkers 0 (unSz (msize marr) - 1)
where
qsort n !lo !hi =
when (lo < hi) $ do
p <- A.unsafeRead marr hi
l <- unstablePartitionRegionM marr (< p) lo hi
A.unsafeWrite marr hi =<< A.unsafeRead marr l
A.unsafeWrite marr l p
if n > 0
then do
let !n' = n - 1
schedule $ qsort n' lo (l - 1)
schedule $ qsort n' (l + 1) hi
else do
qsort n lo (l - 1)
qsort n (l + 1) hi
{-# INLINE quicksortMArray #-}
Now if we look at the arguments numWorkers
and schedule
they are pretty opaque. Say if we supply 1
for the first argument and id
for the second one, we will simply have a sequential quicksort, but if we would have a function available to us that could schedule each task to be computed concurrently, then we would get a parallel implementation of a quicksort. Luckily for us massiv
provides it out of the box withMArray
:
withMArray ::
(Mutable r ix e, MonadUnliftIO m)
=> Array r ix e
-> (Int -> (m () -> m ()) -> MArray RealWorld r ix e -> m a)
-> m (Array r ix e)
Here is a pure version that will make a copy of an array and than sort it in palce using the computation strategy specified within the array itself:
quicksortArray :: (Mutable r Ix1 e, Ord e) => Array r Ix1 e -> Array r Ix1 e
quicksortArray arr = unsafePerformIO $ withMArray arr quicksortMArray
{-# INLINE quicksortArray #-}
Here comes the best part, the benchmarks. The order of results:
vector-algorithms
C
, which I grabbed from this questionmassiv
benchmarking QuickSort/Vector Algorithms
time 101.3 ms (93.75 ms .. 107.8 ms)
0.991 R² (0.974 R² .. 1.000 R²)
mean 97.13 ms (95.17 ms .. 100.2 ms)
std dev 4.127 ms (2.465 ms .. 5.663 ms)
benchmarking QuickSort/Vector
time 89.51 ms (87.69 ms .. 91.92 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 92.67 ms (91.54 ms .. 94.50 ms)
std dev 2.438 ms (1.468 ms .. 3.493 ms)
benchmarking QuickSort/C
time 88.14 ms (86.71 ms .. 89.41 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 90.11 ms (89.17 ms .. 93.35 ms)
std dev 2.744 ms (387.1 μs .. 4.686 ms)
benchmarking QuickSort/Array
time 76.07 ms (75.77 ms .. 76.41 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 76.08 ms (75.75 ms .. 76.28 ms)
std dev 453.7 μs (247.8 μs .. 699.6 μs)
benchmarking QuickSort/Array Par
time 25.25 ms (24.84 ms .. 25.61 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 25.13 ms (24.80 ms .. 25.75 ms)
std dev 991.6 μs (468.5 μs .. 1.782 ms)
Benchmarks are sorting 1,000,000 random Int64
s. If you'd like to see full code you can find it here: https://github.com/lehins/haskell-quicksort
To sum it up, we got a x3 time speed up on a quad core processor and 8 capabilities, which sounds pretty good to me. Thanks for this question, now I can add sorting function to massiv
;)
Edit
Note, that the accepted answer which uses lists instead of a more appropriate data structure for this problem such as a mutable array, is x100 times slower on the same input:
benchmarking List/random/List Par
time 2.712 s (2.566 s .. 3.050 s)
0.998 R² (0.996 R² .. 1.000 R²)
mean 2.696 s (2.638 s .. 2.745 s)
std dev 59.09 ms (40.83 ms .. 72.04 ms)
variance introduced by outliers: 19% (moderately inflated)
Upvotes: 5
Reputation: 50819
I'm not sure this is worth noting, given @lehins excellent answer, but...
pQuickSort
doesn't workThere are two big problems with your pQuickSort
. The first is that you're using System.Random
, which is bog slow and interacts strangely with a parallel sort (see below). The second is that your par u l
sparks a computation to evaluate:
u = [x] ++ pQuicksort (n `div` 2) upper
to WHNF, namely u = x : UNEVALUATED_THUNK
, so your sparks aren't doing any real work.
In fact, it's not difficult to observe a performance improvement when parallelizing a naive, not-in-place, pseudo-quicksort. As mentioned, an important consideration is to avoid using System.Random
. With a fast LCG, we can benchmark the actual sort time, rather than some weird mixture of sort and random number generation. The following pseudo-quicksort:
import Data.List
qsort :: Ord a => [a] -> [a]
qsort (x:xs)
= let (a,b) = partition (<=x) xs
in qsort a ++ x:qsort b
qsort [] = []
randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
let randints = randomList 5000000
print . sum $ qsort randints
when compiled with GHC 8.6.4 and -O2
, runs in about 9.7 seconds. The following "parallelized" version:
qsort :: Ord a => [a] -> [a]
qsort (x:xs)
= let (a,b) = partition (<=x) xs
a' = qsort a
b' = qsort b
in (b' `par` a') ++ x:b'
qsort [] = []
compiled with ghc -O2 -threaded
runs in about 11.0 seconds on one capability. Add +RTS -N4
, and it runs in 7.1 seconds.
Ta da! An improvement.
(In contrast, the version with System.Random
runs in about 13 seconds for the non-parallel version, about 12 seconds for the parallel version on one capability (probably just because of some minor strictness improvement), and slows down substantially for each additional capability added; timings are erratic, too, though I'm not quite sure why.)
partition
One obvious problem with this version is that, even with a' = qsort a
and b' = qsort b
running in parallel, they're tied to the same sequential partition
call. By dividing this up into two filters:
qsort :: Ord a => [a] -> [a]
qsort (x:xs)
= let a = qsort $ filter (<=x) xs
b = qsort $ filter (>x) xs
in b `par` a ++ x:b
qsort [] = []
we speed things up to about 5.5 seconds with -N4
. To be fair, even the non-parallel version is actually slightly faster with two filters
in place of the partition
call, at least when sorting Ints
. There are probably some additional optimizations that are possible with the filters compared to the partition that make the extra comparisons worth it.
Now, what you tried to do in pQuickSort
above was to limit the parallel computations to the top-most set of recursions. Let's use the following psort
to experiment with this:
psort :: Ord a => Int -> [a] -> [a]
psort n (x:xs)
= let a = psort (n-1) $ filter (<=x) xs
b = psort (n-1) $ filter (>x) xs
in if n > 0 then b `par` a ++ x:b else a ++ x:b
psort _ [] = []
This will parallelize the top n
layers of the recursion. My particular LCG example with a seed of 1 (i.e., iterate lcg 1
) recurses up to 54 layers, so psort 55
should give the same performance as the fully parallel version except for the overhead of keeping track of layers. When I run it, I get a time of about 5.8 seconds with -N4
, so the overhead is quite small.
Now, look what happens as we reduce the number of layers:
| Layers | 55 | 40 | 30 | 20 | 10 | 5 | 3 | 1 |
|--------+-----+-----+-----+-----+-----+-----+-----+------|
| time | 5.5 | 5.6 | 5.7 | 5.4 | 7.0 | 8.9 | 9.8 | 10.2 |
Note that, at the lowest layers, there's little to be gained from parallel computation. This is mostly because the average depth of the tree is probably around 25 layers or so, so there's only a handful of computations at 50 layers many with weird, lop-sided partitions, and they're certainly too small to parallelize. On the flip side, there doesn't seem to be any penalty for those extra par
calls.
Meanwhile, there are increasing gains all the way down to at least 20 layers, so trying to artificially limit the total number of sparks to 16 (e.g., top 4 or 5 layers), is a big loss.
Upvotes: 3
Reputation: 1766
par
only evaluates the first argument to weak head normal form. That's to say: if the first argument's type is Maybe Int
then par
will check whether the result is Nothing
or Just something
and stop. It won't evaluate something
at all. Likewise for lists it only evaluates enough to check whether the list is []
or something:something_else
. To evaluate the whole list in parallel: you don't pass the list directly to par
, you make an expression that depends on the list in such a way that when you pass it to par
the entire list is needed. For example:
evalList :: [a] -> ()
evalList [] = ()
evalList (a:r) = a `pseq` evalList r
pMergeSort :: (Ord a) => Int -> [a] -> [a]
pMergeSort _ [] = []
pMergeSort _ [a] = [a]
pMergeSort 0 xs =
let (left, right) = splitList xs
in merge (pMergeSort 0 left) (pMergeSort 0 right)
pMergeSort n xs =
let (left, right) = splitList xs
l = pMergeSort (n `div` 2) left
r = pMergeSort (n `div` 2) right
in (evalList r `par` l) `pseq` (merge l r)
Another note: the overhead for launching new threads in Haskell is really low, so the case for pMergeSort 0 ...
is probably not useful.
Upvotes: 3
Reputation: 835
You won't get any noticeable improvement since your pseudo-quicksort involves list concatenation, which cannot be parallelized and requires quadratic time (total time for all concatenations). I'd recommend you to try working with a mergesort, which is O(n log n)
on linked lists.
Also, to run the program with large number of threads you should compile it with -rtsopts
.
Upvotes: 3
Reputation: 29962
I'm not sure how well it can work for the idiomatic quicksort, but it can work (to a somewhat weak extent) for the true imperative quicksort as shown by Roman in Sparking Imperatives.
He never did get good speedup, though. This really needs a real work-stealing deque that doesn't overflow like the spark queue to optimize properly.
Upvotes: 4