Reputation: 948
I'm trying to read and analyse a huge CSV file. I used Data.Csv.Streaming
from cassava, and functions are applied in the following order:
Data.ByteString.Lazy.readFile -- Gives lazy stream
Data.Csv.Streaming.decodeByname -- Gives Either String (Header Records t)
\(Right (_, v)) -> v -- Gives right side of either (Records t)
Data.Foldable.toList -- Gives [t]
After this the program enters the analysis stage, and executes four (this is very important) different instances (i.e. with different filters) of the following
filter -- Result of toList is applied through a filter
map
Data.Foldable.foldl' -- Does bin counting using a map. The map has at most 60 keys.
However, it appears that the program takes up a huge amount of memory while attempting to load the entire CSV file.
If I only have one instance of foldl' executing, the program does a nice single pass through the CSV data and doesn't consume as much memory. Is there a way to fuse the foldl's together? That is, having
x = foldl' f Map.empty $ filter cx li
y = foldl' f Map.empty $ filter cy li
...
and force it to execute in single pass.
Edit: The following function is used in foldl
with Data.Map.Strict
as Map
:
bincollect :: Ord a => Num b => Map.Map a b -> a -> Map.Map a b
bincollect !m !key = Map.insertWith (+) key 1 m
and the foldl begins with an empty map.
The memory usage grows with the number of elements take
d with or without optimization on.
Upvotes: 2
Views: 127
Reputation: 10091
Yes, you can indeed fuse the four folds together, but you'll have to do it manually. You could try and write out the logic yourself, or you could use a library (like foldl) to help. For instance, you can turn your bincollect into a fold:
bincollect :: (Ord a, Num b) => Fold a (Map.Map a b)
bincollect = Fold (\m key -> Map.insertWith (+) key 1 m) Map.empty id
Then, you can filter using prefilter
:
x = prefilter cx bincollect
Finally, you can combine them together using the Applicative
instance:
(w,x,y,z) = fold ((,,,) <$> prefilter cw bincollect
<*> prefilter cx bincollect
<*> prefilter cy bincollect
<*> prefilter cz bincollect)
input
Upvotes: 2