Reputation: 65
I've written six versions of the length
function. Some of the performance differences make sense, but some of them don't seem to agree with the articles I've read at all (e.g. this one and this one).
-- len1 and lenFold1 should have equivalent performance, right?
len1 :: [a] -> Integer
len1 [] = 0
len1 (x:xs) = len1 xs + 1
lenFold1 :: [a] -> Integer
lenFold1 = foldr (\_ a -> a + 1) 0
-- len2 and lenFold2 should have equivalent performance, right?
len2 :: [a] -> Integer
len2 xs = go xs 0 where
go [] acc = acc
go (x:xs) acc = go xs (1 + acc)
lenFold2 :: [a] -> Integer
lenFold2 = foldl (\a _ -> a + 1) 0
-- len3 and lenFold3 should have equivalent performance, right?
-- And len3 should outperform len1 and len2, right?
len3 :: [a] -> Integer
len3 xs = go xs 0 where
go [] acc = acc
go (x:xs) acc = go xs $! (1 + acc)
lenFold3 :: [a] -> Integer
lenFold3 = foldl' (\a _ -> a + 1) 0
The actual performance on my machine is puzzling.
*Main Lib> :set +m +s
*Main Lib> xs = [1..10000000]
(0.01 secs, 351,256 bytes)
*Main Lib> len1 xs
10000000
(5.47 secs, 2,345,244,016 bytes)
*Main Lib> lenFold1 xs
10000000
(2.74 secs, 1,696,750,840 bytes)
*Main Lib> len2 xs
10000000
(6.02 secs, 2,980,997,432 bytes)
*Main Lib> lenFold2 xs
10000000
(3.97 secs, 1,776,750,816 bytes)
*Main Lib> len3 xs
10000000
(5.24 secs, 3,520,354,616 bytes)
*Main Lib> lenFold3 xs
10000000
(1.24 secs, 1,040,354,528 bytes)
*Main Lib> length xs
10000000
(0.21 secs, 720,354,480 bytes)
My questions:
fold
version of each function consistently outperform the version using explicit recursion?len3
perform better than len1
or len2
?length
perform so much better than any of these implementations?EDIT:
Thanks to Carl's suggestion, my first and second questions are addressed by the fact that GHCI interprets code by default. Running it again with -fobject-code
accounts for the different performance between the explicit recursion and the fold. The new measurements:
Prelude Lib Main> xs = [1..10000000]
(0.00 secs, 354,136 bytes)
Prelude Lib Main> len1 xs
10000000
(1.62 secs, 1,612,661,544 bytes)
Prelude Lib Main> lenFold1 xs
10000000
(1.62 secs, 1,692,661,552 bytes)
Prelude Lib Main> len2 xs
10000000
(2.46 secs, 1,855,662,888 bytes)
Prelude Lib Main> lenFold2 xs
10000000
(2.53 secs, 1,772,661,528 bytes)
Prelude Lib Main> len3 xs
10000000
(0.48 secs, 1,680,361,272 bytes)
Prelude Lib Main> lenFold3 xs
10000000
(0.31 secs, 1,040,361,240 bytes)
Prelude Lib Main> length xs
10000000
(0.18 secs, 720,361,272 bytes)
I still have a few questions about this.
lenFold3
outperform len3
? I ran this a few timeslength
still outperform all of these implementations?Upvotes: 3
Views: 173
Reputation: 50864
I don't think you can properly test performance from GHCi, no matter what flags you try to use.
In general, the best way to do performance testing of Haskell code is to use the Criterion benchmarking library and compile with ghc -O2
. Converted to a Criterion benchmark, your program looks like this:
import Criterion.Main
import GHC.List
import Prelude hiding (foldr, foldl, foldl', length)
len1 :: [a] -> Integer
len1 [] = 0
len1 (x:xs) = len1 xs + 1
lenFold1 :: [a] -> Integer
lenFold1 = foldr (\_ a -> a + 1) 0
len2 :: [a] -> Integer
len2 xs = go xs 0 where
go [] acc = acc
go (x:xs) acc = go xs (1 + acc)
lenFold2 :: [a] -> Integer
lenFold2 = foldl (\a _ -> a + 1) 0
len3 :: [a] -> Integer
len3 xs = go xs 0 where
go [] acc = acc
go (x:xs) acc = go xs $! (1 + acc)
lenFold3 :: [a] -> Integer
lenFold3 = foldl' (\a _ -> a + 1) 0
testLength :: ([Int] -> Integer) -> Integer
testLength f = f [1..10000000]
main = defaultMain
[ bench "lenFold1" $ whnf testLength lenFold1
, bench "len1" $ whnf testLength len1
, bench "len2" $ whnf testLength len2
, bench "lenFold2" $ whnf testLength lenFold2
, bench "len3" $ whnf testLength len3
, bench "lenFold3" $ whnf testLength lenFold3
, bench "length" $ whnf testLength (fromIntegral . length)
]
and the abbreviated results on my machine are:
len1 190.9 ms (136.8 ms .. 238.6 ms)
lenFold1 207.8 ms (151.6 ms .. 248.6 ms)
len2 69.96 ms (69.09 ms .. 71.63 ms)
lenFold2 1.191 s (917.1 ms .. 1.454 s)
len3 69.26 ms (69.20 ms .. 69.35 ms)
lenFold3 87.14 ms (86.95 ms .. 87.35 ms)
length 26.78 ms (26.50 ms .. 27.08 ms)
Note that these results are quite different from the performance you observed running these tests from GHCi, both in absolute and relative terms, and with or without -fobject-code
. Why? Beats me.
Anyway, based on this proper benchmark, len1
and lenFold1
have nearly identical performance. In fact, the Core generated for lenFold1
is:
lenFold1 = len1
so they are identical functions. The apparent difference in my benchmarks is real, though, and it appears to be the result of some cache/alignment issue. If I reorder len1
and lenFold1
in main
, the performance difference flips around (so that len1
is the "slow one").
len2
and len3
also have identical performance because they are identical functions. (In fact, the generated code for len3
is len3 = len2
.) GHC's strictness analyser determines that the expression 1 + acc
can be evaluated strictly, even without the explicit $!
operator.
lenFold3
is slightly slower because foldl'
isn't inlined, so the combining function needs an explicit call every time through. This is arguably a bug that's been reported here. We can work around it by changing the definition of lenFold3
to explicitly provide three arguments to foldl'
:
lenFold3 xs = foldl' (\a _ -> a + 1) 0 xs
and then it performs just as well as len2
and len3
:
lenFold3 66.99 ms (66.76 ms .. 67.30 ms)
The abysmal performance of lenFold2
is a manifestation of the same problem. Without inlining, GHC can't perform proper optimization. If we change the definition to:
lenFold2 xs = foldl (\a _ -> a + 1) 0 xs
it performs just as well as the others:
lenFold2 66.64 ms (66.58 ms .. 66.68 ms)
To be clear, after making these two changes to lenFold2
and lenFold3
, the functions len2
, len3
, lenFold2
, and lenFold3
are all identical, except that lenFold2
and lenFold3
apply the +
operator in a different order. If we use the definitions:
lenFold2 xs = foldl (\a _ -> 1 + a) 0 xs
lenFold3 xs = foldl' (\a _ -> 1 + a) 0 xs
then the generated Core (which you can view with ghc -O2 -ddump-simpl -dsuppress-all -dsuppress-uniques -fforce-recomp
) is actually:
len2 = ...actual definition...
lenFold2 = len2
len3 = len2
lenFold3 = len2
so they're all precisely identical.
They are genuinely different from len1
(or equivalently lenFold1
) because len1
builds up a large set of stack frames that it then needs to process when it gets to the end of the list and "discovers" that an empty list has length zero. The reason there's no stack overflow is that a lot of the blog posts about Haskell stack overflows appears to be either obsolete or based on GHCi tests. In code compiled with modern GHC versions, the maximum stack size defaults to 80% of physical memory, so you can use gigabytes of stack without really noticing. In this case, some quick profiling with +RTS -hT
shows that the stack grows to about 60-70 megabytes for a single len1 [1..10000000]
, not nearly enough to overflow anything. In contrast, the len2
family doesn't accumulate any appreciable stack.
Finally, the reason length
blows them all away is that it calculates the length using an Int
instead of an Integer
. If I change type signatures to:
len1 :: [a] -> Int
len2 :: [a] -> Int
then I get:
len1 144.7 ms (121.8 ms .. 157.9 ms)
len2 27.38 ms (27.31 ms .. 27.44 ms)
length 27.50 ms (27.45 ms .. 27.54 ms)
and len2
(and so lenFold2
, len3
, and lenFold3
) are all as fast as length
.
Upvotes: 11