Reputation: 13677
The following program uses 100+ MB RAM when counting different line lengths in a 250 MB file. How do I fix it to use less RAM? I suppose I misused lazy IO, foldr
and laziness of Data.Map
in values.
import Control.Applicative
import qualified Data.Map as M
import Data.List
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b
Upvotes: 3
Views: 359
Reputation: 183968
The first big mistake in
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b
is using foldr
. That constructs an expression of the form
length firstLine `count` length secondLine `count` ... `count` length lastLine `count` M.empty
traversing the entire list of lines constructing the thunk - at that time not even evaluating the length
calls due to laziness - before it is then evaluated right to left. So the entire file contents is in memory in addition to the thunk for building the Map
.
If you build up a map from a list of things, always use a strict left fold (well, if the list is short, and the things not huge, it doesn't matter) unless the semantics require a right fold (if you're combining values using a non-commutative function, that might be the case, but even then it is often preferable to use a left fold and reverse
the list before building the map).
Data.Map
s (or Data.IntMap
s) are spine-strict, that alone makes it impossible to generate partial output before the entire list has been traversed, so the strengths of foldr
cannot be used here.
The next (possible) problem is (again laziness) that you don't evaluate the mapped-to values when putting them in the Map
, so if there is a line length that occurs particularly often, that value becomes a huge thunk
((...((1+1)+1)...+1)+1)
Make it
main = do
content <- readFile "output.csv"
print $ (foldl' count M.empty . map length . lines) content
count mp a = M.insertWith' (+) a 1 mp
so that the lines can be garbage collected as soon as they have been read in, and no thunks can build up in the values. That way you never need more than one line of the file in memory at once, and even that need not be in memory entirely, since the length
is evaluated before it is recorded in the Map
.
If your containers
package is recent enough, you could also
import Data.Map.Strict
and leave count
using insertWith
(without prime, the Data.Map.Strict
module always evaluates values put into the map).
Upvotes: 5
Reputation: 2916
One way to get max residency down is to use IntMap
instead of Map
, which is a specialized version of the Map
data structure for Int
keys. It's a simple change:
import Control.Applicative
import qualified Data.IntMap as I
import Data.List
main = do
content <- readFile "output.csv"
print $ (foldr count I.empty . map length . lines) content
count a b = I.insertWith (+) a 1 b
Comparing this version against yours using /usr/share/dict/words
saw max residency go from about 100MB to 60MB. Note that this was also without any optimization flags. If you crank those, max residency will very likely see further improvement.
Upvotes: 0