Performance characteristics of length vs fold vs explicit recursion

Question

I've written six versions of the length function. Some of the performance differences make sense, but some of them don't seem to agree with the articles I've read at all (e.g. this one and this one).

-- len1 and lenFold1 should have equivalent performance, right?

len1 :: [a] -> Integer
len1 [] = 0
len1 (x:xs) = len1 xs + 1

lenFold1 :: [a] -> Integer
lenFold1 = foldr (\_ a -> a + 1) 0


-- len2 and lenFold2 should have equivalent performance, right?

len2 :: [a] -> Integer
len2 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs (1 + acc)

lenFold2 :: [a] -> Integer
lenFold2 = foldl (\a _ -> a + 1) 0


-- len3 and lenFold3 should have equivalent performance, right?
-- And len3 should outperform len1 and len2, right?

len3 :: [a] -> Integer
len3 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs $! (1 + acc)

lenFold3 :: [a] -> Integer
lenFold3 = foldl' (\a _ -> a + 1) 0

The actual performance on my machine is puzzling.

*Main Lib> :set +m +s
*Main Lib> xs = [1..10000000]
(0.01 secs, 351,256 bytes)
*Main Lib> len1 xs
10000000
(5.47 secs, 2,345,244,016 bytes)
*Main Lib> lenFold1 xs
10000000
(2.74 secs, 1,696,750,840 bytes)
*Main Lib> len2 xs
10000000
(6.02 secs, 2,980,997,432 bytes)
*Main Lib> lenFold2 xs
10000000
(3.97 secs, 1,776,750,816 bytes)
*Main Lib> len3 xs
10000000
(5.24 secs, 3,520,354,616 bytes)
*Main Lib> lenFold3 xs
10000000
(1.24 secs, 1,040,354,528 bytes)
*Main Lib> length xs
10000000
(0.21 secs, 720,354,480 bytes)

My questions:

Why does the fold version of each function consistently outperform the version using explicit recursion?
None of these implementations ever reaches a stack overflow on my machine, despite the warnings of this article. Why not?
Why doesn't len3 perform better than len1 or len2?
Why does the Prelude's length perform so much better than any of these implementations?

EDIT:

Thanks to Carl's suggestion, my first and second questions are addressed by the fact that GHCI interprets code by default. Running it again with -fobject-code accounts for the different performance between the explicit recursion and the fold. The new measurements:

Prelude Lib Main> xs = [1..10000000]
(0.00 secs, 354,136 bytes)
Prelude Lib Main> len1 xs
10000000
(1.62 secs, 1,612,661,544 bytes)
Prelude Lib Main> lenFold1 xs
10000000
(1.62 secs, 1,692,661,552 bytes)
Prelude Lib Main> len2 xs
10000000
(2.46 secs, 1,855,662,888 bytes)
Prelude Lib Main> lenFold2 xs
10000000
(2.53 secs, 1,772,661,528 bytes)
Prelude Lib Main> len3 xs
10000000
(0.48 secs, 1,680,361,272 bytes)
Prelude Lib Main> lenFold3 xs
10000000
(0.31 secs, 1,040,361,240 bytes)
Prelude Lib Main> length xs
10000000
(0.18 secs, 720,361,272 bytes)

I still have a few questions about this.

Why does lenFold3 outperform len3? I ran this a few times
How does length still outperform all of these implementations?

K. A. Buhr · Accepted Answer

I don't think you can properly test performance from GHCi, no matter what flags you try to use.

In general, the best way to do performance testing of Haskell code is to use the Criterion benchmarking library and compile with ghc -O2. Converted to a Criterion benchmark, your program looks like this:

import Criterion.Main
import GHC.List
import Prelude hiding (foldr, foldl, foldl', length)

len1 :: [a] -> Integer
len1 [] = 0
len1 (x:xs) = len1 xs + 1

lenFold1 :: [a] -> Integer
lenFold1 = foldr (\_ a -> a + 1) 0

len2 :: [a] -> Integer
len2 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs (1 + acc)

lenFold2 :: [a] -> Integer
lenFold2 = foldl (\a _ -> a + 1) 0

len3 :: [a] -> Integer
len3 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs $! (1 + acc)

lenFold3 :: [a] -> Integer
lenFold3 = foldl' (\a _ -> a + 1) 0

testLength :: ([Int] -> Integer) -> Integer
testLength f = f [1..10000000]

main = defaultMain
  [ bench "lenFold1" $ whnf testLength lenFold1
  , bench "len1" $ whnf testLength len1
  , bench "len2" $ whnf testLength len2
  , bench "lenFold2" $ whnf testLength lenFold2
  , bench "len3" $ whnf testLength len3
  , bench "lenFold3" $ whnf testLength lenFold3
  , bench "length" $ whnf testLength (fromIntegral . length)
  ]

and the abbreviated results on my machine are:

len1                 190.9 ms   (136.8 ms .. 238.6 ms)
lenFold1             207.8 ms   (151.6 ms .. 248.6 ms)
len2                 69.96 ms   (69.09 ms .. 71.63 ms)
lenFold2             1.191 s    (917.1 ms .. 1.454 s)
len3                 69.26 ms   (69.20 ms .. 69.35 ms)
lenFold3             87.14 ms   (86.95 ms .. 87.35 ms)
length               26.78 ms   (26.50 ms .. 27.08 ms)

Note that these results are quite different from the performance you observed running these tests from GHCi, both in absolute and relative terms, and with or without -fobject-code. Why? Beats me.

Anyway, based on this proper benchmark, len1 and lenFold1 have nearly identical performance. In fact, the Core generated for lenFold1 is:

lenFold1 = len1

so they are identical functions. The apparent difference in my benchmarks is real, though, and it appears to be the result of some cache/alignment issue. If I reorder len1 and lenFold1 in main, the performance difference flips around (so that len1 is the "slow one").

len2 and len3 also have identical performance because they are identical functions. (In fact, the generated code for len3 is len3 = len2.) GHC's strictness analyser determines that the expression 1 + acc can be evaluated strictly, even without the explicit $! operator.

lenFold3 is slightly slower because foldl' isn't inlined, so the combining function needs an explicit call every time through. This is arguably a bug that's been reported here. We can work around it by changing the definition of lenFold3 to explicitly provide three arguments to foldl':

lenFold3 xs = foldl' (\a _ -> a + 1) 0 xs

and then it performs just as well as len2 and len3:

lenFold3             66.99 ms   (66.76 ms .. 67.30 ms)

The abysmal performance of lenFold2 is a manifestation of the same problem. Without inlining, GHC can't perform proper optimization. If we change the definition to:

lenFold2 xs = foldl (\a _ -> a + 1) 0 xs

it performs just as well as the others:

lenFold2             66.64 ms   (66.58 ms .. 66.68 ms)

To be clear, after making these two changes to lenFold2 and lenFold3, the functions len2, len3, lenFold2, and lenFold3 are all identical, except that lenFold2 and lenFold3 apply the + operator in a different order. If we use the definitions:

lenFold2 xs = foldl (\a _ -> 1 + a) 0 xs
lenFold3 xs = foldl' (\a _ -> 1 + a) 0 xs

then the generated Core (which you can view with ghc -O2 -ddump-simpl -dsuppress-all -dsuppress-uniques -fforce-recomp) is actually:

len2 = ...actual definition...
lenFold2 = len2
len3 = len2
lenFold3 = len2

so they're all precisely identical.

They are genuinely different from len1 (or equivalently lenFold1) because len1 builds up a large set of stack frames that it then needs to process when it gets to the end of the list and "discovers" that an empty list has length zero. The reason there's no stack overflow is that a lot of the blog posts about Haskell stack overflows appears to be either obsolete or based on GHCi tests. In code compiled with modern GHC versions, the maximum stack size defaults to 80% of physical memory, so you can use gigabytes of stack without really noticing. In this case, some quick profiling with +RTS -hT shows that the stack grows to about 60-70 megabytes for a single len1 [1..10000000], not nearly enough to overflow anything. In contrast, the len2 family doesn't accumulate any appreciable stack.

Finally, the reason length blows them all away is that it calculates the length using an Int instead of an Integer. If I change type signatures to:

len1 :: [a] -> Int
len2 :: [a] -> Int

then I get:

len1                 144.7 ms   (121.8 ms .. 157.9 ms)
len2                 27.38 ms   (27.31 ms .. 27.44 ms)
length               27.50 ms   (27.45 ms .. 27.54 ms)

and len2 (and so lenFold2, len3, and lenFold3) are all as fast as length.

Performance characteristics of length vs fold vs explicit recursion

Answers (1)

Related Questions