abden003
abden003

Reputation: 1335

Increasing performance in file manipulation

I have a file which contains a matrix of numbers as following:

0 10 24 10 13 4 101 ...
6 0 52 10 4 5 0 4 ...
3 4 0 86 29 20 77 294 ...
4 1 1 0 78 100 83 199 ...
5 4 9 10 0 58 8 19 ...
6 58 60 13 68 0 148 41 ...
. .
.   .
.     .

What I am trying to do is sum each row and output the sum of each row to a new file (with the sum of each row on a new line).

I have tried doing it in Haskell using ByteStrings, but the performance is 3 times a slow as the python implementation. Here is the Haskell implementation:

import qualified Data.ByteString.Char8 as B

-- This function is for summing a row
sumrows r = foldr (\x y -> (maybe 0 (*1) $ fst <$> (B.readInt x)) + y) 0 (B.split ' ' r)

-- This function is for mapping the sumrows function to each line
sumfile f = map (\x -> (show x) ++ "\n") (map sumrows (B.split '\n' f)) 

main = do
  contents <- B.readFile "telematrix"
  -- I get the sum of each line, and then pack up all the results so that it can be written
  B.writeFile "teleDensity" $ (B.pack . unwords) (sumfile contents)
  print "complete"

This takes about 14 seconds for a 25 MB file.

Here is the python implemenation

fd = open("telematrix", "r")
nfd = open("teleDensity", "w")

for line in fd: 
  nfd.write(str(sum(map(int, line.split(" ")))) + "\n")

fd.close()
nfd.close()

This takes about 5 seconds for the same 25 MB file.

Any suggestions on how to increase the Haskell implementation?

Upvotes: 1

Views: 120

Answers (3)

abden003
abden003

Reputation: 1335

The main reason for the poor performance was because I was using runhaskell instead of first compiling and then running the program. So I switched from:

runhaskell program.hs

to

ghc program.hs

./program

Upvotes: 0

abden003
abden003

Reputation: 1335

It seems that he problem was that I was compiling and running the program with runhaskell as opposed to using ghc and then running the program. By compiling first and then running, I increased performance to 1 second in Haskell

Upvotes: 1

Silvio Mayolo
Silvio Mayolo

Reputation: 70377

At a glance, I would bet your first bottleneck is in the ++ on strings in sumfile, which is destructuring the left operand each time and rebuilding it. Instead of appending "\n" to the end, you could replace the unwords function call with unlines, which does exactly what you want it to here. That should get you a nice little speed boost.

A more minor nitpick is that the (*1) in the maybe function is unneeded. Using id there would be more efficient, since (*1) wastes a multiplication operation, but that's no more than a few processor cycles.

Then finally, I have to ask why you're using ByteString's here. ByteString's store string data efficiently as an array, like traditional strings in a more imperative language. However, what you're doing here involves splitting the string and iterating over the elements, which are operations that linked lists would be suited for. I would honestly recommend using the traditional [Char] type in this case. That B.split call may be what's ruining you, since it has to take the entire line and copy it into separate arrays of the split form, whereas the words function for linked lists of characters simply splits the linked structure off at a few points.

Upvotes: 0

Related Questions