Counting different values using Data.Map leaks memory

Question

The following program uses 100+ MB RAM when counting different line lengths in a 250 MB file. How do I fix it to use less RAM? I suppose I misused lazy IO, foldr and laziness of Data.Map in values.

import Control.Applicative
import qualified Data.Map as M
import Data.List

main = do
  content <- readFile "output.csv"
  print $ (foldr count M.empty . map length . lines) content

count a b = M.insertWith (+) a 1 b

Daniel Fischer · Accepted Answer

The first big mistake in

main = do
  content <- readFile "output.csv"
  print $ (foldr count M.empty . map length . lines) content

count a b = M.insertWith (+) a 1 b

is using foldr. That constructs an expression of the form

length firstLine `count` length secondLine `count` ... `count` length lastLine `count` M.empty

traversing the entire list of lines constructing the thunk - at that time not even evaluating the length calls due to laziness - before it is then evaluated right to left. So the entire file contents is in memory in addition to the thunk for building the Map.

If you build up a map from a list of things, always use a strict left fold (well, if the list is short, and the things not huge, it doesn't matter) unless the semantics require a right fold (if you're combining values using a non-commutative function, that might be the case, but even then it is often preferable to use a left fold and reverse the list before building the map).

Data.Maps (or Data.IntMaps) are spine-strict, that alone makes it impossible to generate partial output before the entire list has been traversed, so the strengths of foldr cannot be used here.

The next (possible) problem is (again laziness) that you don't evaluate the mapped-to values when putting them in the Map, so if there is a line length that occurs particularly often, that value becomes a huge thunk

((...((1+1)+1)...+1)+1)

Make it

main = do
  content <- readFile "output.csv"
  print $ (foldl' count M.empty . map length . lines) content

count mp a = M.insertWith' (+) a 1 mp

so that the lines can be garbage collected as soon as they have been read in, and no thunks can build up in the values. That way you never need more than one line of the file in memory at once, and even that need not be in memory entirely, since the length is evaluated before it is recorded in the Map.

If your containers package is recent enough, you could also

import Data.Map.Strict

and leave count using insertWith (without prime, the Data.Map.Strict module always evaluates values put into the map).

Counting different values using Data.Map leaks memory

Answers (2)

Related Questions