Reputation: 62808
Haskell provides a par
combinator, which queues a "spark" for possible evaluation in parallel with the current thread. It also provides a pseq
combinator, which forces evaluation of pure code to occur in a specific order.
What Haskell does not appear to provide is a way to generate several sparks and then wait for them all to finish. This is quite trivial to achieve with explicit concurrency, but appears to be impossible with pure sparks.
In part, this is perhaps because of the intended use-case for sparks. They appear to be designed for speculative evaluation. That is, doing work which might be needed, but might not be. Hence, sparks only run on cores which are otherwise idle.
However, this is not my use-case. I have a bunch of results which I know, for a fact, will be needed soon. And if I start trying to process the results before the sparks have fired, I'll just end up single-threaded again with a bunch of fizzled sparks.
Of course, of par
waited for sparks to complete, it wouldn't achieve any parallelism! But it would be really useful if there were some way to produce several sparks and then wait for them all to finish. I can't find any way to do that though.
Does anybody have any useful suggestions? (Other than "use explicit concurrency", obviously.)
Upvotes: 12
Views: 567
Reputation: 4761
The Really Short Answer
You can't.
The Short Answer
To wait for a spark to finish, try to evaluate what the spark was evaluating. For example, if you have expressions a
and b
, to calculate a + b
, you can do
a `par` b `par` (a + b)
or
a `par` b `pseq` (a + b)
The Long Answer
When you create a spark by using par
, you are telling the run time system "I will need this value later, so you should evaluate it in parallel." When you later need the value, either the spark has evaluated the expression or it hasn't. If it has, the thunk will be replaced by a value, and so re-evaluation has no cost - it is just fetching the value. If it hasn't been evaluated be the spark then waiting for the spark is useless - it might take a while to get scheduled, and the thread waiting is wasting time. Instead of waiting, you should just evaluate the expression yourself. Essentially, there is no need to wait for a spark. You just try to evaluate the original expression and get performance benifits.
Also, a note on speculation - although sparks can be and often are used for speculation, that is not completely what they are designed for. I see par
being used for simple parallelization, as in pfib
below, much more often than I see it used for speculation.
Examples
A standard example is parallelizing the Fibonacci numbers, from the serial
fib 0 = 0
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
to the parallel
pfib 0 = 0
pfib 1 = 1
pfib n = l `par` r `pseq` (l + r) where
l = pfib (n - 1)
r = pfib (n - 2)
.
Now for an example using speculation:
spec :: a -- a guess to the input value
-> (a -> b) -- a function to tranform the input value
-> a -- the actual input value - this will require computation
-> b -- the function applied to the input value
spec guess f input = let speculation = f guess in speculation `par`
if guess == input
then speculation
else f input
The hackage package I got this from, speculation, actually had a couple optimizations like not doing this on a single core and checking whether the input was already evaluated, but that doesn't matter to the working of the function.
Other Solutions that Make Things More Explicit
par
.Upvotes: 6
Reputation: 183873
You can try putting the results of the sparked computations into a strict data structure
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Parallel
fib :: Int -> Int
fib n
| n < 1 = 0
| n == 1 = 1
| otherwise = fib (n-1) + fib (n-2)
trib :: Int -> Int
trib n
| n < 1 = 0
| n < 3 = 1
| otherwise = trib (n-1) + trib (n-2) + trib (n-3)
data R = R { res1, res2 :: !Int }
main :: IO ()
main = do
let !r = let a = fib 38
b = trib 31
in a `par` b `pseq` (R a b)
print $ res1 r
print $ fib 28
print $ res2 r
That worked here:
$ ./spark +RTS -N2 -s
39088169
317811
53798080
65,328 bytes allocated in the heap
9,688 bytes copied during GC
5,488 bytes maximum residency (1 sample(s))
30,680 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 1.33 (686 / 515, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.59s ( 0.59s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 0.59s) 0.00s ( 0.00s)
Task 2 (bound) : 0.59s ( 0.59s) 0.00s ( 0.00s)
Task 3 (worker) : 0.00s ( 0.59s) 0.00s ( 0.00s)
SPARKS: 1 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.17s ( 0.59s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.18s ( 0.59s elapsed)
Alloc rate 55,464 bytes per MUT second
Productivity 99.9% of total user, 199.1% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
Upvotes: 6