Reputation: 8136
I'm writing a program that creates a shell script containing one command for each image file in a directory. There are 667,944 images in the directory, so I need to handle the strictness/laziness issue properly.
Here's a simple example that gives me Stack space overflow
. It does work if I give it more space using +RTS -Ksize -RTS
, but it should be able run with little memory, producing output immediately. So I've been reading the stuff about strictness in the Haskell wiki and the wikibook on Haskell, trying to figure out how to fix the problem, and I think it's one of the mapM commands that is giving me grief, but I still don't understand enough about strictness to sort the problem.
I've found some other questions on SO that seem relevant (Is mapM in Haskell strict? Why does this program get a stack overflow? and Is Haskell's mapM not lazy?), but enlightenment still eludes me.
import System.Environment (getArgs)
import System.Directory (getDirectoryContents)
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
main :: IO ()
main = do
putStrLn "#!/bin/sh"
(indir:outdir:_) <- getArgs
files <- getDirectoryContents indir
let imageFiles = filter (`notElem` [".", ".."]) files
commands <- mapM (genCommand indir outdir) imageFiles
mapM_ putStrLn commands
EDIT: TEST #1
Here's the newest version of the example.
import System.Environment (getArgs)
import System.Directory (getDirectoryContents)
import Control.Monad ((>=>))
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
main :: IO ()
main = do
putStrLn "TEST 1"
(indir:outdir:_) <- getArgs
files <- getDirectoryContents indir
putStrLn $ show (length files)
let imageFiles = filter (`notElem` [".", ".."]) files
-- mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles
I compile it with the command ghc --make -O2 amy2.hs -rtsopts
. If I run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat
, I get
TEST 1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
If I instead run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat +RTS -K20M
, I get the correct output...eventually:
TEST 1
667946
convert /home/amy/nosync/GalaxyZoo/table2/images//587736546846572812.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736546846572812.jpeg
convert /home/amy/nosync/GalaxyZoo/table2/images//587736542558617814.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736542558617814.jpeg
...and so on.
Upvotes: 3
Views: 360
Reputation: 24802
This isn't really a strictness issue(*), but an order of evaluation issue. Unlike lazily evaluated pure values, monadic effects must happen in deterministic order. mapM
executes every action in the given list and gathers the results, but it cannot return until the whole list of actions is executed, so you don't get the same streaming behavior as with pure list functions.
The easy fix in this case is to run both genCommand
and putStrLn
inside the same mapM_
. Note that mapM_
doesn't suffer from the same issue since it is not building an intermediate list.
mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
The above uses the "kleisli composition operator" >=>
from Control.Monad
which is like the function composition operator .
except for monadic functions. You can also use the normal bind and a lambda.
mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles
For more complex I/O applications where you want better composability between small, monadic stream processors, you should use a library such as conduit
or pipes
.
Also, make sure you are compiling with either -O
or -O2
.
(*) To be exact, it is also a strictness issue, because in addition to building a large, intermediate list in memory, laziness causes mapM
to build unnecessary thunks and use up stack.
EDIT: So it seems the main culprit might be getDirectoryContents
. Looking at the function's source code, it essentially does the same kind of list accumulation internally as mapM
.
In order to do streaming directory listing, we need to use System.Posix.Directory
which unfortunately makes the program incompatible with non-POSIX systems (like Windows). You can stream the directory contents by e.g. using continuation passing style
import System.Environment (getArgs)
import Control.Monad ((>=>))
import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
import Control.Exception (bracket)
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
streamingDirContents :: FilePath -> (FilePath -> IO ()) -> IO ()
streamingDirContents root cont = do
let loop stream = do
fp <- readDirStream stream
case fp of
[] -> return ()
_ | fp `notElem` [".", ".."] -> cont fp >> loop stream
| otherwise -> loop stream
bracket (openDirStream root) loop closeDirStream
main :: IO ()
main = do
putStrLn "TEST 1"
(indir:outdir:_) <- getArgs
streamingDirContents indir (genCommand indir outdir >=> putStrLn)
Here's how you could do the same thing using conduit
:
import System.Environment (getArgs)
import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
import Data.Conduit
import qualified Data.Conduit.List as L
import Control.Monad.IO.Class (liftIO, MonadIO)
genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
let infile = indir ++ '/':file
let angle = 0 -- have to actually read the file to calculate this for real
let outfile = outdir ++ '/':file
return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
" -crop 143x143+140+140 " ++ outfile
dirSource :: (MonadResource m, MonadIO m) => FilePath -> Source m FilePath
dirSource root = do
bracketP (openDirStream root) closeDirStream $ \stream -> do
let loop = do
fp <- liftIO $ readDirStream stream
case fp of
[] -> return ()
_ -> yield fp >> loop
loop
main :: IO ()
main = do
putStrLn "TEST 1"
(indir:outdir:_) <- getArgs
let files = dirSource indir $= L.filter (`notElem` [".", ".."])
commands = files $= L.mapM (liftIO . genCommand indir outdir)
runResourceT $ commands $$ L.mapM_ (liftIO . putStrLn)
The nice thing about conduit
is that you regain the ability to compose pieces of functionality with things like conduit versions of filter
and mapM
. The $=
operator streams stuff forward in the chain and $$
connects the stream to a consumer.
The not-so-nice thing is that real world is complicated and writing efficient and robust code requires us to jump through some hoops with resource management. That's why all the operations work in the ResourceT
monad transformer which keeps track of e.g. open file handles and cleans them up promptly and deterministically when they are no longer needed or e.g. if the computation gets aborted by an exception (this is in contrast to using lazy I/O and relying on the garbage collector to eventually release any scarce resources).
However, this means that we a) need to run the final resulting conduit operation with runResourceT
and b) we need to explicitly lift I/O operations to the transformed monad using liftIO
instead of being able to directly write e.g. L.mapM_ putStrLn
.
Upvotes: 6