Cartesius00
Cartesius00

Reputation: 24384

Summing a large list of numbers is too slow

Task: "Sum the first 15,000,000 even numbers."

Haskell:

nats = [1..] :: [Int]
evens = filter even nats :: [Int]

MySum:: Int
MySum= sum $ take 15000000 evens

...but MySum takes ages. More precisely, about 10-20 times slower than C/C++.

Many times I've found, that a Haskell solution coded naturally is something like 10 times slower than C. I expected that GHC was a very neatly optimizing compiler and task such this don't seem that tough.

So, one would expect something like 1.5-2x slower than C. Where is the problem?

Can this be solved better?

This is the C code I'm comparing it with:

long long sum = 0;
int n = 0, i = 1;

for (;;) {

  if (i % 2 == 0) {
    sum += i;
    n++;
  }

  if (n == 15000000)
    break;

  i++;
}

Edit 1: I really know, that it can be computed in O(1). Please, resist.

Edit 2: I really know, that evens are [2,4..] but the function even could be something else O(1) and need to be implemented as a function.

Upvotes: 9

Views: 2032

Answers (7)

Will Ness
Will Ness

Reputation: 71075

Sum first 15,000,000 even numbers:

{-# LANGUAGE BangPatterns #-}

g :: Integer    -- 15000000*15000001 = 225000015000000
g = go 1 0 0
  where
    go i !a c  | c == 15000000 = a       
    go i !a c  | even i = go (i+1) (a+i) (c+1)
    go i !a c           = go (i+1) a c

ought to be the fastest.

Upvotes: 5

yatima2975
yatima2975

Reputation: 6610

Another thing to note is that nats and evens are so-called Constant Applicative Forms, or CAFs for short. Basically, those correspond to top-level definitions without any arguments. CAFs are a bit of an odd duck, for instance being the reason for the Dreaded Monomorphism Restriction; I'm not sure the language definition even allows CAFs to be inlined.

In my mental model of how Haskell executes, by the time dumbSum returns a value, evens will be evaluated to look something like 2:4: ... : 30000000 : <thunk> and nats to 1:2: ... : 30000000 : <thunk>, where the <thunk>s represent something that's not been looked at yet. If my understanding is correct, these allocations of : do have to happen and can't be optimized away.

So one way of speeding things up without altering your code too much would be to simply write:

dumbSum :: Int
dumbSum = sum . take 15000000 . filter even $ [1..]

or

dumbSum = sum $ take 15000000 evens where
    nats = [1..]
    evens = filter even nats

On my machine, compiled with -O2, that alone seems to result in a roughly 30% speedup.

I'm no GHC connaisseur (I've never even profiled a Haskell program!), so I could be wildly off the mark, though.

Upvotes: 1

Peter Wortmann
Peter Wortmann

Reputation: 2312

The problem why list fusion can't work here is actually rather subtle. Say we define the right RULE to fuse the list away:

import GHC.Base
sum2 :: Num a => [a] -> a
sum2 = sum
{-# NOINLINE [1] sum2 #-}
{-# RULES "sum" forall (f :: forall b. (a->b->b)->b->b).
                sum2 (build f) = f (+) 0 #-}

(The short explanation is that we define sum2 as an alias of sum, which we forbid GHC to inline early, so the RULE has a chance to fire before sum2 gets eliminated. Then we look for sum2 directly next to the list-builder build (see definition) and replace it by direct arithmetic.)

This has mixed success, as it yields the following Core:

Main.$wgo =
  \ (w_s1T4 :: GHC.Prim.Int#) ->
    case GHC.Prim.remInt# w_s1T4 2 of _ {
      __DEFAULT ->
        case w_s1T4 of wild_Xg {
          __DEFAULT -> Main.$wgo (GHC.Prim.+# wild_Xg 1);
          15000000 -> 0
        };
      0 ->
        case w_s1T4 of wild_Xg {
          __DEFAULT ->
            case Main.$wgo (GHC.Prim.+# wild_Xg 1) of ww_s1T7 { __DEFAULT ->
            GHC.Prim.+# wild_Xg ww_s1T7
            };
          15000000 -> 15000000
        }
    }

Which is nice, completely fused code - with the sole problem that we have a call to $wgo in a non-tail-call position. This means that we aren't looking at a loop, but actually at a deeply recursive function, with predictable program results:

Stack space overflow: current size 8388608 bytes.

The root problem here is that the Prelude's list fusion can only fuse right folds, and computing the sum as a right fold directly causes the excessive stack consumption. The obvious fix would be to use a fusion framework that can actually deal with left folds, such as Duncan's stream-fusion package, which actually implements sum fusion.

Another solution would be to hack around it - and implement the left fold using a right fold:

main = print $ foldr (\x c -> c . (+x)) id [2,4..15000000] 0

This actually produces close-to-perfect code with current versions of GHC. On the other hand, this is generally not a good idea as it relies on GHC being smart enough to eliminate the partially applied functions. Already adding a filter into the chain will break that particular optimization.

Upvotes: 11

Petr
Petr

Reputation: 63379

First, you can be clever as young Gauss was and compute the sum in O(1).

Fun stuff aside, your Haskell solution uses lists. I'm quite sure your C/C++ solution doesn't. (Haskell lists are very easy to use so one is tempted to use them even in cases where it might not be appropriate.) Try benchmarking this:

sumBy2 :: Integer -> Integer
sumBy2 = f 0
  where
    f result n | n <= 1     = result
               | otherwise  = f (n + result) (n - 2)

Compile it using GHC with -O2 argument. This function is tail-recursive so compiler can implement it very efficiently.

Update: If you want it using even function, it's possible:

sumBy2 :: Integer -> Integer
sumBy2 = f 0
  where
    f result n | n <= 0     = result
               | even n     = f (n + result) (n - 1)
               | otherwise  = f result (n - 1)

You can also easily make the filtering function a parameter:

sumFilter :: (Integral a) => (a -> Bool) -> a -> a
sumFilter filtfn = f 0
  where
    f result n | n <= 0     = result
               | filtfn n   = f (n + result) (n - 1)
               | otherwise  = f result (n - 1)

Upvotes: 3

amindfv
amindfv

Reputation: 8438

If you want to be sure to traverse the list only once, you can write the traversal explicitly:

nats = [1..] :: [Int]

requiredOfX :: Int -> Bool -- this way you can write a different requirement
requiredOfX x = even x

dumbSum :: Int
dumbSum = dumbSum' 0 0 nats
  where dumbSum' acc 15000000 _ = acc
        dumbSum' acc count (x:xs)
          | requiredOfX x = dumbSum' (acc + x) (count + 1) xs
          | otherwise     = dumbSum' acc (count + 1) xs

Upvotes: 4

Daniel Fischer
Daniel Fischer

Reputation: 183908

Lists are not loops

So don't be surprised if using lists as a loop replacement, you get slower code if the loop body is small.

nats = [1..] :: [Int]
evens = filter even nats :: [Int]

dumbSum :: Int
dumbSum = sum $ take 15000000 evens

sum is not a "good consumer", so GHC is not (yet) able to eliminate the intermediate lists completely.

If you compile with optimisations (and don't export nat), GHC is smart enough to fuse the filter with the enumeration,

Rec {
Main.main_go [Occ=LoopBreaker]
  :: GHC.Prim.Int# -> GHC.Prim.Int# -> [GHC.Types.Int]
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L]
Main.main_go =
  \ (x_aV2 :: GHC.Prim.Int#) ->
    let {
      r_au7 :: GHC.Prim.Int# -> [GHC.Types.Int]
      [LclId, Str=DmdType]
      r_au7 =
        case x_aV2 of wild_Xl {
          __DEFAULT -> Main.main_go (GHC.Prim.+# wild_Xl 1);
          9223372036854775807 -> n_r1RR
        } } in
    case GHC.Prim.remInt# x_aV2 2 of _ {
      __DEFAULT -> r_au7;
      0 ->
        let {
          wild_atm :: GHC.Types.Int
          [LclId, Str=DmdType m]
          wild_atm = GHC.Types.I# x_aV2 } in
        let {
          lvl_s1Rp :: [GHC.Types.Int]
          [LclId]
          lvl_s1Rp =
            GHC.Types.:
              @ GHC.Types.Int wild_atm (GHC.Types.[] @ GHC.Types.Int) } in
        \ (m_aUL :: GHC.Prim.Int#) ->
          case GHC.Prim.<=# m_aUL 1 of _ {
            GHC.Types.False ->
              GHC.Types.: @ GHC.Types.Int wild_atm (r_au7 (GHC.Prim.-# m_aUL 1));
            GHC.Types.True -> lvl_s1Rp
          }
    }
end Rec }

but that's as far as GHC's fusion takes it. You are left with boxing Ints and constructing list cells. If you give it a loop, like you give it to the C compiler,

module Main where

import Data.Bits

main :: IO ()
main = print dumbSum

dumbSum :: Int
dumbSum = go 0 0 1
  where
    go :: Int -> Int -> Int -> Int
    go sm ct n
        | ct >= 15000000 = sm
        | n .&. 1 == 0   = go (sm + n) (ct+1) (n+1)
        | otherwise      = go sm ct (n+1)

you get the approximate relation of running times between the C and the Haskell version you expected.

This sort of algorithm is not what GHC has been taught to optimise well, there are bigger fish to fry elsewhere before the limited manpower is put into these optimisations.

Upvotes: 24

Cfr
Cfr

Reputation: 5133

Strict version works much faster:

foldl' (+) 0 $ take 15000000 [2, 4..]

Upvotes: 2

Related Questions