sdasdadas
sdasdadas

Reputation: 25096

How do I reduce garbage collection when parsing a binary file in Haskell?

I am writing a program to parse .TGA files using Haskell - however, the performance is absolutely dreadful. A 2048 x 2048 pixel image takes a few seconds to parse.

I ran my code using +RTS -p -RTS and received the following interesting pieces from the report:

total time = 1.08 secs
total alloc = 3,037,568,120 bytes

COST CENTRE           MODULE    %time    %alloc
readPixelMap            Main     33.0      11.0
readPixelMap.getByte    Main     32.7      75.1
readPixelMap.getColor   Main     27.0      13.3

It appears that my program is allocating a huge amount of memory in the readPixelMap function. That function looks like this:

readPixelMap width height = do
    pixels <- replicateM height (replicateM width getColor)
    return $ PixMap pixels
    where getByte = liftM toInteger getWord8
          getColor = do (red:green:blue:alpha:xs) <- replicateM 4 getByte
                        return (red, green, blue, alpha)

and a PixMap is defined as

data PixMap = PixMap [[RGBAColor]] deriving (Show)
type RGBAColor = (Integer, Integer, Integer, Integer)

readPixelMap is called using the Get monad from Data.Binary:

main = do
    binary <- BS.readFile "test.tga"
    let (pixelMap, binary', nil) = runGetState (readPixelMap 2048 2048) binary 0

I've been working under the impression that a 2048 x 2048 .TGA image should load into a few hundred megabytes (at most) of memory. Is there an obvious way to improve my garbage collection times / allocation amount?

Upvotes: 7

Views: 261

Answers (2)

user2407038
user2407038

Reputation: 14578

  1. Using a tuple of Integer to store colour data (which is always 32 bits of data) doesn't make sense. Why not use any of the following variants: (Word8, Word8, Word8, Word8); data Colour = C Word8 Word8 Word8 Word; data Colour = C Word32

  2. A lazy ByteString is just a list of strict ByteString. So if you are reading the data byte by byte, you will essentially produce a list of strict ByteString whose length is 2048*2048. Not very efficient.

  3. The same can be said of using [[a]] to store a two dimensional array. In terms of memory performance, this is just about the worst datatype you could use. Try any one of these: Array (Int, Int) Color; UArray (Int, Int) Color; (Int, Int, Ptr Word8). My preference is the arrays from Data.Array.Repa which are high performance, have many built-in traversals, folds, etc., and naturally support parallelization. Another big advantage is Data.Array.Repa.Repr.ByteString.fromByteString will convert a strict ByteString into an array for further manipulation. The best part about this is you no longer have to worry about performance when reading the array; it has been done for you!

An example:

import Data.Array.Repa.Repr.ByteString
import Data.Array.Repa
import qualified Data.ByteString as B
import Data.Word (Word8)

data TGA = TGA (Array B DIM2 Word8)

readTGAFromFile :: FilePath -> (Int, Int) -> IO TGA
readTGAFromFile file (width,height) = do
  dat <- B.readFile file
  return $ TGA $ fromByteString (Z :. height :. width) dat

Upvotes: 2

Carl
Carl

Reputation: 27003

Your RGBColor type uses 5 machine words, 4 of which are pointers to each Integer. Each Integer consists of one machine word for GC/tagging, and either one additional word for the small representation, or the large representation consisting of a pointer to a byte array for the GMP data and a limb count.

Additionally, you're using a list of lists to store the values. Each non-empty list node is a word for GC/tagging, a word pointer to the value, and a word pointer to the tail.

To use less memory, use appropriate data types. Use unpacked values where possible. Use Word8 instead of Integer for 8-bit values. Use arrays, instead of lists. You know, the basics of acting like you care about memory use.

Take a look at http://johantibell.com/files/slides.pdf for some of the basics.

Upvotes: 11

Related Questions