Reputation: 577
I am trying to split a file into two separate files by alternating lines. (i.e. lines 1,3,5,7.. written to file 1 and lines 2,4,6,8... written to file 2).
The file I am working with is ~700MB, so when I seen the memory usage balloon over 6GB, I know something is wrong.
main :: IO()
main = withFile splitFile ReadMode splitData
where
splitData h = do
dataSet <- lines <$> hGetContents h
let (s1,s2) = foldl' (\(l,r) x -> (x:r,l)) ([],[]) dataSet
writeFile testFile $ unlines s1
writeFile trainingFile $ unlines s2
I initially was using the lazy version of foldl, but after some research it seemed that using the strict version would help. But alas, it made no noticeable difference. I also tried compiling with -O2, but that did nothing either.
I am using GHC 7.10.2
I'm not getting a stack overflow, so what is it using all that memory for?
Upvotes: 2
Views: 104
Reputation: 14578
As mentioned in a comment by @dfeuer, the use of writeFile
will force the entire string to be written to be computed, which also forces the entire input to be read. The space leak is caused by the fact that the entire second file must be kept in memory while the first file is being written, when it is obvious that one must only keep in memory one line at a time. And indeed the solution is to write line by line:
import Control.Monad
import System.IO
main :: IO ()
main =
withFile splitFile ReadMode $ \hIn ->
withFile testFile WriteMode $ \hOdd ->
withFile trainingFile WriteMode $ \hEven ->
zipWithM_ hPutStrLn (cycle [hOdd, hEven]) . lines =<< hGetContents hIn
This program runs in constant space.
Upvotes: 7