erjiang
erjiang

Reputation: 45737

Optimizing Haskell text processing

I'm writing some simple character-counting routines in Haskell, storing the stats in a new datatype:

data Stat = Stat {
    stChars    :: !Int,
    stVowels   :: !Int,
    stPairEL   :: !Int,
    stWords    :: !Int
}

I'm running this over hundreds or thousands of plain-text files, each about 50K--100K.

tabulateFile :: FilePath -> IO Stat
tabulateFile path = do
  putStrLn path
  contents <- L.readFile path
  return $! tabulateText ' ' contents defaultStat

Instead of using fold-left, I'm using primitive recursion so I can keep around the previous character.

tabulateText :: Char -> L.ByteString -> Stat -> Stat
tabulateText lastChr bs stat =
  case U.uncons bs of
    Nothing -> stat
    Just (chr, newBs) ->
      tabulateText lchr newBs (countChar lastChr lchr stat)
        where lchr = toLower chr

{-# INLINE countChar #-}
countChar :: Char -> Char -> Stat -> Stat
countChar !lastChr !chr !(Stat stChars stVowels stPairEL stWords) =
  Stat
    (stChars  + 1)
    (stVowels + (countIf $ isVowel chr))
    (stPairEL + (countIf (lastChr == 'e' && chr == 'l')))
    (stWords  + (countIf ((not $ isLetter lastChr) && isLetter chr)))

isVowel :: Char -> Bool
isVowel c = Set.member c vowels

vowels = Set.fromAscList ['a', 'e', 'i', 'o', 'u', ...] -- rest of vowels elided

Right now, it's more than twice as slow as running cat * | wc, but my instinct tells me that the file I/O should outweigh the CPU time needed by a good margin. Simply using cat * | wc processes about 20MB/s with a hot cache, but using my Haskell program (compiled with -O) runs at less than 10MB/s, even after some basic optimization. Profiling tells me that most of the time is spent in tabulateText and countChar.

Is there something I'm missing that I could optimize here?

Edit: Complete file pasted to http://hpaste.org/74638

Upvotes: 2

Views: 285

Answers (3)

erjiang
erjiang

Reputation: 45737

After testing other alternatives, it seems that the high CPU usage was mostly because I was using Data.ByteString.Lazy.UTF8. I shaved a negligible bit of runtime off by modifying the data structures in tabulateText to use foldl over the UTF8 ByteString.

Given that, I parallelized the program over the files, and was able to sometimes get 7x speedup on my machine.

I first wrapped tabulateFile with an unsafePerformIO:

unsafeTabulateFile :: FilePath -> Stat
unsafeTabulateFile f =
  unsafePerformIO $ tabulateFile f

and then used Control.Parallel.Strategies to do parMap rseq unsafeTabulateFile files.

Upvotes: 0

Satvik
Satvik

Reputation: 11218

{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as U
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Char
import Control.Applicative 

data Stat = Stat {
    stChars    :: !Int,
    stVowels   :: !Int,
    stPairEL   :: !Int,
    stWords    :: !Int
} deriving Show
defaultStat = Stat 0 0 0 0

{-# INLINE  tabulateFile #-}
tabulateFile :: FilePath -> IO Stat
tabulateFile path = newTabulate <$> L.readFile path

{-#  INLINE newTabulate #-}
newTabulate :: L.ByteString -> Stat 
newTabulate = snd . U.foldl' countIt (' ',defaultStat) 
    where 
        {-#  INLINE countIt #-}
        countIt :: (Char,Stat) -> Char -> (Char,Stat)
        countIt (!lastChr,!Stat stChars stVowels stPairEL stWords) !chr = 
            (chr,Stat
                (stChars  + 1)
                (if isVowel chr then stVowels + 1 else stVowels)
                (if (lastChr == 'e' && chr == 'l') then stPairEL + 1 else stPairEL)
                (if ((isSpace lastChr) && isLetter chr) then stWords+1 else stWords))

{-# INLINE isVowel #-}
isVowel :: Char -> Bool
isVowel c = 
    c == 'e' ||
    c == 'a' ||
    c == 'i' ||
    c == 'o' ||
    c == 'u' 



main:: IO ()
main = do 
    stat <- tabulateFile "./test.txt"
    print stat

Most of the optimizations as suggested by Don is included along with using efficient foldl'. The performance is slightly slower than cat + wc, but it is ok as you are doing some more computation. I have not tested it on very big file but it should work comparable to cat + wc.

Compile with -O2 -funbox-strict-fields to get optimized code.

I will change it more after looking into the core and see if more optimizations are possible. One possible point of optimization is having if conditions out of the constructor while constructing stat, for example if chr is a vowel then it is already a letter so you don't need the other if in stWords etc, but that will really blow up your code, but you can try and see if it really helps on large files.

Upvotes: 4

Don Stewart
Don Stewart

Reputation: 137987

You should provide the imports so someone can compile the code. However, there's several things here that look likely:

  • Compile with -O2 -funbox-strict-fields (to get the benefit of strict fields)
  • tabulateText should be strict in lastChr, and stat
  • Set.member seems like a very expensive way to do equality comparisons. Use a jump table.

E.g.

isSpaceChar8 :: Char -> Bool
isSpaceChar8 c =
    c == ' '     ||
    c == '\t'    ||
    c == '\n'    ||
    c == '\r'    ||
    c == '\f'    ||
    c == '\v'    ||
    c == '\xa0'

which will inline and optimize very well.

Not sure what countIf does, but it loosk bad. I suspect its an if and you return 0? How about:

Stat
   (a + 1)
   (if isVowel c then a + 1 else a)
   ...

Then look at the core.

Upvotes: 10

Related Questions