Haskell Optimizations for List Processing stymied by Lazy Evaluation

Question

I'm trying to improve the efficiency of the following code. I want to count all occurrences of a symbol before a given point (as part of pattern-matching using a Burrows-Wheeler transform). There's some overlap in how I'm counting symbols. However, when I have tried to implement what looks like it should be more efficient code, it turns out to be less efficient, and I'm assuming that lazy evaluation and my poor understanding of it is to blame.

My first attempt at a counting function went like this:

count :: Ord a => [a] -> a -> Int -> Int
count list sym pos = length . filter (== sym) . take pos $ list

Then in the body of the matching function itself:

matching str refCol pattern = match 0 (n - 1) (reverse pattern)
  where n = length str
        refFstOcc sym = length $ takeWhile (/= sym) refCol
        match top bottom [] = bottom - top + 1
        match top bottom (sym : syms) =
          let topCt = count str sym top
              bottomCt = count str sym (bottom + 1)
              middleCt = bottomCt - topCt
              refCt = refFstOcc sym
          in if middleCt > 0
               then match (refCt + topCt) (refCt + bottomCt - 1) syms
               else 0

(Stripped down for brevity - I'm memoizing first occurrences of symbols in refCol through a Map, and a couple other details as well).

Edit: Sample use would be:

matching "AT$TCTAGT" "$AACGTTTT" "TCG"

which should be 1 (assuming I didn't mistype anything).

Now, I'm recounting everything in the middle between the top pointer and the bottom twice, which adds up when I count a million character DNA string with only 4 possible choices for characters (and profiling tells me that this is the big bottleneck, too, taking 48% of my time for bottomCt and around 38% of my time for topCt). For reference, when calculating this for a million character string and trying to match 50 patterns (each of which is between 1 and 1000 characters), the program takes about 8.5 to 9.5 seconds to run.

However, if I try to implement the following function:

countBetween :: Ord a => [a] -> a -> Int -> Int -> (Int, Int)
countBetween list sym top bottom =
  let (topList, bottomList) = splitAt top list
      midList = take (bottom - top) bottomList
      getSyms = length . filter (== sym)
  in (getSyms topList, getSyms midList)

(with changes made to the matching function to compensate), the program takes between 18 and 22 seconds to run.

I've also tried passing in a Map which can keep track of previous calls, but that also takes about 20 seconds to run and runs up the memory usage.

Similarly, I've shorted length . filter (== sym) to a fold, but again - 20 seconds for foldr, and 14-15 for foldl.

So what would be a proper Haskell way to optimize this code through rewriting it? (Specifically, I'm looking for something that doesn't involve precomputation - I may not be reusing strings very much - and which explains something of why this is happening).

Edit: More clearly, what I am looking for is the following:

a) Why does this behaviour happen in Haskell? How does lazy evaluation play a role, what optimizations is the compiler making to rewrite the count and countBetween functions, and what other factors may be involved?

b) What is a simple code rewrite which would address this issue so that I don't traverse the lists multiple times? I'm looking specifically for something which addresses that issue, rather than a solution which sidesteps it. If the final answer is, count is the most efficient possible way to write the code, why is that?

ErikR · Accepted Answer

I'm not sure lazy evaluation has much to do with the performance of the code. I think the main problem is the use of String - which is a linked list - instead of more performant string type.

Note that this call in your countBetween function:

  let (topList, bottomList) = splitAt top list

will re-create the linked link corresponding to topList meaning a lot more allocations.

A Criterion benchmark to compare splitAt versus using take n/drop n may be found here: http://lpaste.net/174526. The splitAt version is about 3 times slower and, of course, has a lot more allocations.

Even if you don't want to "pre-compute" the counts you can improve matters a great deal by simply switching to either ByteString or Text.

Define:

countSyms :: Char -> ByteString -> Int -> Int -> Int
countSyms sym str lo hi =
  length [ i | i <- [lo..hi], BS.index str i == sym ]

and then:

countBetween :: ByteString -> Char -> Int -> Int -> (Int,Int)
countBetween str sym top bottom = (a,b)
  where a = countSyms sym str 0 (top-1)
        b = countSyms sym str top (bottom-1)

Also, don't use reverse on large lists - it will reallocate the entire list. Just index into a ByteString / Text in reverse.

Memoizing counts may or may not help. It all depends on how it's done.

Haskell Optimizations for List Processing stymied by Lazy Evaluation

Answers (2)

Related Questions