Reputation: 24384
Task: "Sum the first 15,000,000 even numbers."
Haskell:
nats = [1..] :: [Int]
evens = filter even nats :: [Int]
MySum:: Int
MySum= sum $ take 15000000 evens
...but MySum
takes ages. More precisely, about 10-20 times slower than C/C++.
Many times I've found, that a Haskell solution coded naturally is something like 10 times slower than C. I expected that GHC was a very neatly optimizing compiler and task such this don't seem that tough.
So, one would expect something like 1.5-2x slower than C. Where is the problem?
Can this be solved better?
This is the C code I'm comparing it with:
long long sum = 0;
int n = 0, i = 1;
for (;;) {
if (i % 2 == 0) {
sum += i;
n++;
}
if (n == 15000000)
break;
i++;
}
Edit 1: I really know, that it can be computed in O(1). Please, resist.
Edit 2: I really know, that evens are [2,4..]
but the function even
could be something else O(1)
and need to be implemented as a function.
Upvotes: 9
Views: 2032
Reputation: 71075
Sum first 15,000,000 even numbers:
{-# LANGUAGE BangPatterns #-}
g :: Integer -- 15000000*15000001 = 225000015000000
g = go 1 0 0
where
go i !a c | c == 15000000 = a
go i !a c | even i = go (i+1) (a+i) (c+1)
go i !a c = go (i+1) a c
ought to be the fastest.
Upvotes: 5
Reputation: 6610
Another thing to note is that nats
and evens
are so-called Constant Applicative Forms, or CAFs for short. Basically, those correspond to top-level definitions without any arguments. CAFs are a bit of an odd duck, for instance being the reason for the Dreaded Monomorphism Restriction; I'm not sure the language definition even allows CAFs to be inlined.
In my mental model of how Haskell executes, by the time dumbSum
returns a value, evens
will be evaluated to look something like 2:4: ... : 30000000 : <thunk>
and nats
to 1:2: ... : 30000000 : <thunk>
, where the <thunk>
s represent something that's not been looked at yet. If my understanding is correct, these allocations of :
do have to happen and can't be optimized away.
So one way of speeding things up without altering your code too much would be to simply write:
dumbSum :: Int
dumbSum = sum . take 15000000 . filter even $ [1..]
or
dumbSum = sum $ take 15000000 evens where
nats = [1..]
evens = filter even nats
On my machine, compiled with -O2
, that alone seems to result in a roughly 30% speedup.
I'm no GHC connaisseur (I've never even profiled a Haskell program!), so I could be wildly off the mark, though.
Upvotes: 1
Reputation: 2312
The problem why list fusion can't work here is actually rather subtle. Say we define the right RULE
to fuse the list away:
import GHC.Base
sum2 :: Num a => [a] -> a
sum2 = sum
{-# NOINLINE [1] sum2 #-}
{-# RULES "sum" forall (f :: forall b. (a->b->b)->b->b).
sum2 (build f) = f (+) 0 #-}
(The short explanation is that we define sum2
as an alias of sum
, which we forbid GHC to inline early, so the RULE
has a chance to fire before sum2
gets eliminated. Then we look for sum2
directly next to the list-builder build
(see definition) and replace it by direct arithmetic.)
This has mixed success, as it yields the following Core:
Main.$wgo =
\ (w_s1T4 :: GHC.Prim.Int#) ->
case GHC.Prim.remInt# w_s1T4 2 of _ {
__DEFAULT ->
case w_s1T4 of wild_Xg {
__DEFAULT -> Main.$wgo (GHC.Prim.+# wild_Xg 1);
15000000 -> 0
};
0 ->
case w_s1T4 of wild_Xg {
__DEFAULT ->
case Main.$wgo (GHC.Prim.+# wild_Xg 1) of ww_s1T7 { __DEFAULT ->
GHC.Prim.+# wild_Xg ww_s1T7
};
15000000 -> 15000000
}
}
Which is nice, completely fused code - with the sole problem that we have a call to $wgo
in a non-tail-call position. This means that we aren't looking at a loop, but actually at a deeply recursive function, with predictable program results:
Stack space overflow: current size 8388608 bytes.
The root problem here is that the Prelude's list fusion can only fuse right folds, and computing the sum as a right fold directly causes the excessive stack consumption.
The obvious fix would be to use a fusion framework that can actually deal with left folds, such as Duncan's stream-fusion package, which actually implements sum
fusion.
Another solution would be to hack around it - and implement the left fold using a right fold:
main = print $ foldr (\x c -> c . (+x)) id [2,4..15000000] 0
This actually produces close-to-perfect code with current versions of GHC. On the other hand, this is generally not a good idea as it relies on GHC being smart enough to eliminate the partially applied functions. Already adding a filter
into the chain will break that particular optimization.
Upvotes: 11
Reputation: 63379
First, you can be clever as young Gauss was and compute the sum in O(1).
Fun stuff aside, your Haskell solution uses lists. I'm quite sure your C/C++ solution doesn't. (Haskell lists are very easy to use so one is tempted to use them even in cases where it might not be appropriate.) Try benchmarking this:
sumBy2 :: Integer -> Integer
sumBy2 = f 0
where
f result n | n <= 1 = result
| otherwise = f (n + result) (n - 2)
Compile it using GHC with -O2
argument. This function is tail-recursive so compiler can implement it very efficiently.
Update: If you want it using even
function, it's possible:
sumBy2 :: Integer -> Integer
sumBy2 = f 0
where
f result n | n <= 0 = result
| even n = f (n + result) (n - 1)
| otherwise = f result (n - 1)
You can also easily make the filtering function a parameter:
sumFilter :: (Integral a) => (a -> Bool) -> a -> a
sumFilter filtfn = f 0
where
f result n | n <= 0 = result
| filtfn n = f (n + result) (n - 1)
| otherwise = f result (n - 1)
Upvotes: 3
Reputation: 8438
If you want to be sure to traverse the list only once, you can write the traversal explicitly:
nats = [1..] :: [Int]
requiredOfX :: Int -> Bool -- this way you can write a different requirement
requiredOfX x = even x
dumbSum :: Int
dumbSum = dumbSum' 0 0 nats
where dumbSum' acc 15000000 _ = acc
dumbSum' acc count (x:xs)
| requiredOfX x = dumbSum' (acc + x) (count + 1) xs
| otherwise = dumbSum' acc (count + 1) xs
Upvotes: 4
Reputation: 183908
So don't be surprised if using lists as a loop replacement, you get slower code if the loop body is small.
nats = [1..] :: [Int]
evens = filter even nats :: [Int]
dumbSum :: Int
dumbSum = sum $ take 15000000 evens
sum
is not a "good consumer", so GHC is not (yet) able to eliminate the intermediate lists completely.
If you compile with optimisations (and don't export nat
), GHC is smart enough to fuse the filter
with the enumeration,
Rec {
Main.main_go [Occ=LoopBreaker]
:: GHC.Prim.Int# -> GHC.Prim.Int# -> [GHC.Types.Int]
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L]
Main.main_go =
\ (x_aV2 :: GHC.Prim.Int#) ->
let {
r_au7 :: GHC.Prim.Int# -> [GHC.Types.Int]
[LclId, Str=DmdType]
r_au7 =
case x_aV2 of wild_Xl {
__DEFAULT -> Main.main_go (GHC.Prim.+# wild_Xl 1);
9223372036854775807 -> n_r1RR
} } in
case GHC.Prim.remInt# x_aV2 2 of _ {
__DEFAULT -> r_au7;
0 ->
let {
wild_atm :: GHC.Types.Int
[LclId, Str=DmdType m]
wild_atm = GHC.Types.I# x_aV2 } in
let {
lvl_s1Rp :: [GHC.Types.Int]
[LclId]
lvl_s1Rp =
GHC.Types.:
@ GHC.Types.Int wild_atm (GHC.Types.[] @ GHC.Types.Int) } in
\ (m_aUL :: GHC.Prim.Int#) ->
case GHC.Prim.<=# m_aUL 1 of _ {
GHC.Types.False ->
GHC.Types.: @ GHC.Types.Int wild_atm (r_au7 (GHC.Prim.-# m_aUL 1));
GHC.Types.True -> lvl_s1Rp
}
}
end Rec }
but that's as far as GHC's fusion takes it. You are left with boxing Int
s and constructing list cells. If you give it a loop, like you give it to the C compiler,
module Main where
import Data.Bits
main :: IO ()
main = print dumbSum
dumbSum :: Int
dumbSum = go 0 0 1
where
go :: Int -> Int -> Int -> Int
go sm ct n
| ct >= 15000000 = sm
| n .&. 1 == 0 = go (sm + n) (ct+1) (n+1)
| otherwise = go sm ct (n+1)
you get the approximate relation of running times between the C and the Haskell version you expected.
This sort of algorithm is not what GHC has been taught to optimise well, there are bigger fish to fry elsewhere before the limited manpower is put into these optimisations.
Upvotes: 24
Reputation: 5133
Strict version works much faster:
foldl' (+) 0 $ take 15000000 [2, 4..]
Upvotes: 2