Henricus V.
Henricus V.

Reputation: 948

Fusing multiple foldl' in Haskell

I'm trying to read and analyse a huge CSV file. I used Data.Csv.Streaming from cassava, and functions are applied in the following order:

Data.ByteString.Lazy.readFile -- Gives lazy stream
Data.Csv.Streaming.decodeByname -- Gives Either String (Header Records t)
\(Right (_, v)) -> v -- Gives right side of either (Records t)
Data.Foldable.toList -- Gives [t]

After this the program enters the analysis stage, and executes four (this is very important) different instances (i.e. with different filters) of the following

filter -- Result of toList is applied through a filter
map
Data.Foldable.foldl' -- Does bin counting using a map. The map has at most 60 keys.

However, it appears that the program takes up a huge amount of memory while attempting to load the entire CSV file.

If I only have one instance of foldl' executing, the program does a nice single pass through the CSV data and doesn't consume as much memory. Is there a way to fuse the foldl's together? That is, having

x = foldl' f Map.empty $ filter cx li
y = foldl' f Map.empty $ filter cy li
...

and force it to execute in single pass.

Edit: The following function is used in foldl with Data.Map.Strict as Map:

bincollect :: Ord a => Num b => Map.Map a b -> a -> Map.Map a b
bincollect !m !key = Map.insertWith (+) key 1 m

and the foldl begins with an empty map.

The memory usage grows with the number of elements taked with or without optimization on.

Upvotes: 2

Views: 127

Answers (1)

oisdk
oisdk

Reputation: 10091

Yes, you can indeed fuse the four folds together, but you'll have to do it manually. You could try and write out the logic yourself, or you could use a library (like foldl) to help. For instance, you can turn your bincollect into a fold:

bincollect :: (Ord a, Num b) => Fold a (Map.Map a b)
bincollect = Fold (\m key -> Map.insertWith (+) key 1 m) Map.empty id

Then, you can filter using prefilter:

x = prefilter cx bincollect

Finally, you can combine them together using the Applicative instance:

(w,x,y,z) = fold ((,,,) <$> prefilter cw bincollect
                        <*> prefilter cx bincollect
                        <*> prefilter cy bincollect
                        <*> prefilter cz bincollect)
                 input

Upvotes: 2

Related Questions