nponeccop
nponeccop

Reputation: 13677

Counting different values using Data.Map leaks memory

The following program uses 100+ MB RAM when counting different line lengths in a 250 MB file. How do I fix it to use less RAM? I suppose I misused lazy IO, foldr and laziness of Data.Map in values.

import Control.Applicative
import qualified Data.Map as M
import Data.List

main = do
  content <- readFile "output.csv"
  print $ (foldr count M.empty . map length . lines) content

count a b = M.insertWith (+) a 1 b

Upvotes: 3

Views: 359

Answers (2)

Daniel Fischer
Daniel Fischer

Reputation: 183968

The first big mistake in

main = do
  content <- readFile "output.csv"
  print $ (foldr count M.empty . map length . lines) content

count a b = M.insertWith (+) a 1 b

is using foldr. That constructs an expression of the form

length firstLine `count` length secondLine `count` ... `count` length lastLine `count` M.empty

traversing the entire list of lines constructing the thunk - at that time not even evaluating the length calls due to laziness - before it is then evaluated right to left. So the entire file contents is in memory in addition to the thunk for building the Map.

If you build up a map from a list of things, always use a strict left fold (well, if the list is short, and the things not huge, it doesn't matter) unless the semantics require a right fold (if you're combining values using a non-commutative function, that might be the case, but even then it is often preferable to use a left fold and reverse the list before building the map).

Data.Maps (or Data.IntMaps) are spine-strict, that alone makes it impossible to generate partial output before the entire list has been traversed, so the strengths of foldr cannot be used here.

The next (possible) problem is (again laziness) that you don't evaluate the mapped-to values when putting them in the Map, so if there is a line length that occurs particularly often, that value becomes a huge thunk

((...((1+1)+1)...+1)+1)

Make it

main = do
  content <- readFile "output.csv"
  print $ (foldl' count M.empty . map length . lines) content

count mp a = M.insertWith' (+) a 1 mp

so that the lines can be garbage collected as soon as they have been read in, and no thunks can build up in the values. That way you never need more than one line of the file in memory at once, and even that need not be in memory entirely, since the length is evaluated before it is recorded in the Map.

If your containers package is recent enough, you could also

import Data.Map.Strict

and leave count using insertWith (without prime, the Data.Map.Strict module always evaluates values put into the map).

Upvotes: 5

seliopou
seliopou

Reputation: 2916

One way to get max residency down is to use IntMap instead of Map, which is a specialized version of the Map data structure for Int keys. It's a simple change:

import Control.Applicative
import qualified Data.IntMap as I
import Data.List

main = do
  content <- readFile "output.csv"
  print $ (foldr count I.empty . map length . lines) content

count a b = I.insertWith (+) a 1 b

Comparing this version against yours using /usr/share/dict/words saw max residency go from about 100MB to 60MB. Note that this was also without any optimization flags. If you crank those, max residency will very likely see further improvement.

Upvotes: 0

Related Questions