MasterMastic
MasterMastic

Reputation: 21306

Why does a strict length function perform noticeably faster?

I toyed around with definitions to better understand the evaluation model, and wrote two for the length of a list.

The naive definition:

len :: [a] -> Int
len [] = 0
len (_:xs) = 1 + len xs

The strict (and tail-recursive) definition:

slen :: [a] -> Int -> Int
slen [] n = n
slen (_:xs) !n = slen xs (n+1)

len [1..10000000] takes about 5-6 seconds to perform.
slen [1..10000000] 0 takes about 3-4 seconds to perform.

I'm curious why. Before I checked the performances I was positive that they would perform about the same because len should have only one more thunk to evaluate at most. For demonstration purposes:

len [a,b,c,d]
= 1 + len [b,c,d]
= 1 + 1 + len [c,d]
= 1 + 1 + 1 + len [d]
= 1 + 1 + 1 + 1 + len []
= 1 + 1 + 1 + 1 + 0
= 4

And

slen [a,b,c,d] 0
= slen [b,c,d] 1
= slen [c,d]   2
= slen [d]     3
= slen []      4
= 4

What makes slen noticeably faster?

P.S. I also wrote a tail-recursive lazy function (just like slen but lazy) as an attempt to close-in on the reason -- maybe it's because it was tail-recursive -- but it performed about the same as the naive definition.

Upvotes: 8

Views: 173

Answers (2)

Christian Conkle
Christian Conkle

Reputation: 5992

David Young's answer gives the correct explanation of the difference in evaluation order. You should think about Haskell evaluation in the way he outlines.

Let me show you how you can see the difference in the Core. I think it's actually more visible with optimizations on, because the evaluation ends up as an explicit case statement. If you've never played with Core before, see the canonical SO question on the topic: Reading GHC Core.

Generate the core output with ghc -O2 -ddump-simpl -dsuppress-all -ddump-to-file SO27392665.hs. You'll see that GHC splits both len and slen into a recursive "worker" function, $wlen or $wslen, and a nonrecursive "wrapper" function. Because the vast majority of the time is spent in the recursive "workers," focus on them:

Rec {
$wlen
$wlen =
  \ @ a_arZ w_sOR ->
    case w_sOR of _ {
      [] -> 0;
      : ds_dNU xs_as0 ->
        case $wlen xs_as0 of ww_sOU { __DEFAULT -> +# 1 ww_sOU }
    }
end Rec }

len
len =
  \ @ a_arZ w_sOR ->
    case $wlen w_sOR of ww_sOU { __DEFAULT -> I# ww_sOU }

Rec {
$wslen
$wslen =
  \ @ a_arR w_sOW ww_sP0 ->
    case w_sOW of _ {
      [] -> ww_sP0;
      : ds_dNS xs_asW -> $wslen xs_asW (+# ww_sP0 1)
    }
end Rec }

slen
slen =
  \ @ a_arR w_sOW w1_sOX ->
    case w1_sOX of _ { I# ww1_sP0 ->
    case $wslen w_sOW ww1_sP0 of ww2_sP4 { __DEFAULT -> I# ww2_sP4 }
    }

You can see that $wslen has only one case, while $wlen has two. If you go look at David's answer, you can trace what happens in $wlen: it does its case analysis on the outermost list constructor ([]/:), then makes the recursive call to $wlen xs_as0 (i.e. len xs), which it also cases, i.e. forces the accumulated thunk.

In $wslen, on the other hand, there's only the one case statement. In the recursive branch, there's simply an unboxed addition, (+# ww_sP0 1), which doesn't create a thunk.

(Note: a previous version of this answer had stated that with -O GHC could specialize $wslen but not $wlen to use unboxed Int#s. That's not the case.)

Upvotes: 4

David Young
David Young

Reputation: 10793

The final step of len is not O(1). It is O(n) to add together n numbers. len also uses O(n) memory while slen uses O(1) memory.

The reason it uses O(n) memory is that each thunk uses up some memory. So when you have something like this:

1 + 1 + 1 + 1 + len []

there are five unevaluated thunks (including len [])

In GHCi, we can examine this thunk behavior a little easier with the :sprint command. The :sprint command prints the given value without forcing the evaluating of any thunks (you can learn more from :help). I'll use conses ((:)) since we can more easily evaluate each thunk one at a time, but the principle is the same.

λ> let ys = map id $ 1 : 2 : 3 : [] :: [Int] -- map id prevents GHCi from being too eager here
λ> :sprint ys
ys = _
λ> take 1 ys
[1]
λ> :sprint ys
ys = 1 : _
λ> take 2 ys
[1,2]
λ> :sprint ys
ys = 1 : 2 : _
λ> take 3 ys
[1,2,3]
λ> :sprint ys
ys = 1 : 2 : 3 : _
λ> take 4 ys
[1,2,3]
λ> :sprint ys
ys = [1,2,3]

Unevaluated thunks are represented by _ and you can see that in the original ys there are 4 thunks nested inside of each other, one for each part of the list (including []).

There isn't a good way that I know of to see this in Int because its evaluation is more all or nothing, but it still builds up a nested thunk in the same way. If you could see it like this, its evaluation would look something like this:

len [a,b,c,d]
= 1 + len [b,c,d]
= 1 + 1 + len [c,d]
= 1 + 1 + 1 + len [d]
= 1 + 1 + 1 + 1 + len []
= 1 + 1 + 1 + 1 + 0
= 1 + 1 + 1 + 1       -- Here it stops building the thunks and starts evaluating them
= 1 + 1 + 2
= 1 + 3
= 4

Upvotes: 9

Related Questions