Reputation: 45737
I'm writing some simple character-counting routines in Haskell, storing the stats in a new datatype:
data Stat = Stat {
stChars :: !Int,
stVowels :: !Int,
stPairEL :: !Int,
stWords :: !Int
}
I'm running this over hundreds or thousands of plain-text files, each about 50K--100K.
tabulateFile :: FilePath -> IO Stat
tabulateFile path = do
putStrLn path
contents <- L.readFile path
return $! tabulateText ' ' contents defaultStat
Instead of using fold-left, I'm using primitive recursion so I can keep around the previous character.
tabulateText :: Char -> L.ByteString -> Stat -> Stat
tabulateText lastChr bs stat =
case U.uncons bs of
Nothing -> stat
Just (chr, newBs) ->
tabulateText lchr newBs (countChar lastChr lchr stat)
where lchr = toLower chr
{-# INLINE countChar #-}
countChar :: Char -> Char -> Stat -> Stat
countChar !lastChr !chr !(Stat stChars stVowels stPairEL stWords) =
Stat
(stChars + 1)
(stVowels + (countIf $ isVowel chr))
(stPairEL + (countIf (lastChr == 'e' && chr == 'l')))
(stWords + (countIf ((not $ isLetter lastChr) && isLetter chr)))
isVowel :: Char -> Bool
isVowel c = Set.member c vowels
vowels = Set.fromAscList ['a', 'e', 'i', 'o', 'u', ...] -- rest of vowels elided
Right now, it's more than twice as slow as running cat * | wc
, but my instinct tells me that the file I/O should outweigh the CPU time needed by a good margin. Simply using cat * | wc
processes about 20MB/s with a hot cache, but using my Haskell program (compiled with -O
) runs at less than 10MB/s, even after some basic optimization. Profiling tells me that most of the time is spent in tabulateText
and countChar
.
Is there something I'm missing that I could optimize here?
Edit: Complete file pasted to http://hpaste.org/74638
Upvotes: 2
Views: 285
Reputation: 45737
After testing other alternatives, it seems that the high CPU usage was mostly because I was using Data.ByteString.Lazy.UTF8. I shaved a negligible bit of runtime off by modifying the data structures in tabulateText
to use foldl
over the UTF8 ByteString.
Given that, I parallelized the program over the files, and was able to sometimes get 7x speedup on my machine.
I first wrapped tabulateFile
with an unsafePerformIO
:
unsafeTabulateFile :: FilePath -> Stat
unsafeTabulateFile f =
unsafePerformIO $ tabulateFile f
and then used Control.Parallel.Strategies
to do parMap rseq unsafeTabulateFile files
.
Upvotes: 0
Reputation: 11218
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as U
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Char
import Control.Applicative
data Stat = Stat {
stChars :: !Int,
stVowels :: !Int,
stPairEL :: !Int,
stWords :: !Int
} deriving Show
defaultStat = Stat 0 0 0 0
{-# INLINE tabulateFile #-}
tabulateFile :: FilePath -> IO Stat
tabulateFile path = newTabulate <$> L.readFile path
{-# INLINE newTabulate #-}
newTabulate :: L.ByteString -> Stat
newTabulate = snd . U.foldl' countIt (' ',defaultStat)
where
{-# INLINE countIt #-}
countIt :: (Char,Stat) -> Char -> (Char,Stat)
countIt (!lastChr,!Stat stChars stVowels stPairEL stWords) !chr =
(chr,Stat
(stChars + 1)
(if isVowel chr then stVowels + 1 else stVowels)
(if (lastChr == 'e' && chr == 'l') then stPairEL + 1 else stPairEL)
(if ((isSpace lastChr) && isLetter chr) then stWords+1 else stWords))
{-# INLINE isVowel #-}
isVowel :: Char -> Bool
isVowel c =
c == 'e' ||
c == 'a' ||
c == 'i' ||
c == 'o' ||
c == 'u'
main:: IO ()
main = do
stat <- tabulateFile "./test.txt"
print stat
Most of the optimizations as suggested by Don is included along with using efficient foldl'. The performance is slightly slower than cat + wc, but it is ok as you are doing some more computation. I have not tested it on very big file but it should work comparable to cat + wc.
Compile with -O2 -funbox-strict-fields
to get optimized code.
I will change it more after looking into the core and see if more optimizations are possible.
One possible point of optimization is having if conditions out of the constructor while constructing stat, for example if chr
is a vowel then it is already a letter
so you don't need the other if in stWords etc, but that will really blow up your code, but you can try and see if it really helps on large files.
Upvotes: 4
Reputation: 137987
You should provide the imports so someone can compile the code. However, there's several things here that look likely:
-O2 -funbox-strict-fields
(to get the benefit of strict fields)lastChr
, and stat
E.g.
isSpaceChar8 :: Char -> Bool
isSpaceChar8 c =
c == ' ' ||
c == '\t' ||
c == '\n' ||
c == '\r' ||
c == '\f' ||
c == '\v' ||
c == '\xa0'
which will inline and optimize very well.
Not sure what countIf
does, but it loosk bad. I suspect its an if
and you return 0?
How about:
Stat
(a + 1)
(if isVowel c then a + 1 else a)
...
Then look at the core.
Upvotes: 10