Free_D
Free_D

Reputation: 577

Where is the space leak in this code?

I am trying to split a file into two separate files by alternating lines. (i.e. lines 1,3,5,7.. written to file 1 and lines 2,4,6,8... written to file 2).

The file I am working with is ~700MB, so when I seen the memory usage balloon over 6GB, I know something is wrong.

main :: IO()
main = withFile splitFile ReadMode splitData
  where
    splitData h = do
      dataSet <- lines <$> hGetContents h
      let (s1,s2) = foldl' (\(l,r) x -> (x:r,l)) ([],[]) dataSet
      writeFile testFile $ unlines s1
      writeFile trainingFile $ unlines s2

I initially was using the lazy version of foldl, but after some research it seemed that using the strict version would help. But alas, it made no noticeable difference. I also tried compiling with -O2, but that did nothing either.

I am using GHC 7.10.2

I'm not getting a stack overflow, so what is it using all that memory for?

Upvotes: 2

Views: 104

Answers (1)

user2407038
user2407038

Reputation: 14578

As mentioned in a comment by @dfeuer, the use of writeFile will force the entire string to be written to be computed, which also forces the entire input to be read. The space leak is caused by the fact that the entire second file must be kept in memory while the first file is being written, when it is obvious that one must only keep in memory one line at a time. And indeed the solution is to write line by line:

import Control.Monad 
import System.IO 

main :: IO ()
main = 
  withFile splitFile ReadMode $ \hIn ->  
  withFile testFile WriteMode $ \hOdd ->  
  withFile trainingFile WriteMode $ \hEven ->         
  zipWithM_ hPutStrLn (cycle [hOdd, hEven]) . lines =<< hGetContents hIn

This program runs in constant space.

Upvotes: 7

Related Questions