Reputation: 1965
I happened to sort a list of german words by their natural frequency1. I am not happy about the memory performance of my algorithm.
The graphic is created with hp/D3.js. It shows the runtime heap for V1, V2, and V3, as given in the code below.
I uploaded the complete code including short instructions on how to run with profiling (via stack and nix) on github here. It is pasted complete below as well.
The Version 1 reads both large files using strict IO from Data.Text.IO
. It can be seen quite nicely how there is a difference to Version 2 and 3 with Lazy IO from Data.Text.Lazy.IO
: Something jumps into existence immediately, whereas the Version 2 and 3 build up.
I can give quite accurate sizes based on these formulae, and I know what's in the files, averaging the length of the german words there to about 16 characters. These numbers are not interpreted from the output, but rather calculated independendly.
HashMap Text Int
)[Text]
)Vector Text
)Apart from that I am completely lost. I am trying to understand those issues:
Why are my {-# SCC foo #-}
ignored? I don't have control over the cost centers in the profiling. This happens on both, GHC 8.8.4 and GHC 9.2.1, nix/cabal and stack alike.
The profile suggests a peak memory usage of little more than 1 GB. However, running top
, I can see that really the algorithm uses up to 2.6 GB. This is nearly double the amount. Shouldn't those amounts be equal?
Where is the garbage collection happening? My suspicion is that there is none. The Versions 2 and 3 show some gargabe collection but only to the degree that their memory use was excessive over Version 1's.
Can I expect a much leaner memory profile at all, given my choice of hashmap, list, and vector? Just adding the hashmap and the vector would amount to 717 MB, less than half of what I see in top
. How to get there?
Are there other, preferable devices for this kind of task? I chose vector for the sorting algorithm. I can't move to either of Storable
, Unboxed
, or Primitive
because of Text
(at least I don't know how).
In the summary of the runtime statistics (s. below), it says "Productivity 43.5%". My guess would be, the profiling itself is part of the cause. But could it be that there is excessive activity of the garbage collector, too, based on the numbers?
-- app/Main.hs
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Category ((<<<))
import Control.Monad.ST (runST)
import Data.Functor ((<&>))
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as HashMap
import Data.Maybe (catMaybes, fromMaybe)
import Data.Ord (Down (Down), comparing)
import Data.Text (Text)
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Text.Lazy as Lazy
import qualified Data.Text.Lazy.IO as Lazy
import Data.Vector (Vector, freeze, thaw)
import qualified Data.Vector as Vector
import qualified Data.Vector.Algorithms.Tim as Tim
import System.IO (hFlush, stdout)
import GHC.Conc (pseq)
main :: IO ()
main = do
putStr ""
putStr "Running v1 ..."
hFlush stdout
u1 <- runV1
putStrLn $ u1 `seq` " done."
putStrLn ""
putStr "Running v2 ..."
hFlush stdout
u2 <- runV2
putStrLn $ u2 `seq` " done."
putStrLn ""
putStr "Running v3 ..."
hFlush stdout
u3 <- runV3
putStrLn $ u3 `seq` " done."
fileFrequencies :: FilePath
fileFrequencies = "deu_news_2020_freq.txt"
fileData :: FilePath
fileData = "german.utf8.dic"
fileSorted :: FilePath
fileSorted = "german.utf8.sorted.dic"
{- |
straightforward implementation, using Text-based IO
-}
runV1 :: IO ()
runV1 = do
mapFrequencies <- readFrequencies
ls <- Text.lines <$> Text.readFile fileData
let sorted = quicksort mapFrequencies $ {-# SCC vec #-} Vector.fromList ({-# SCC ls #-} ls)
Text.writeFile fileSorted $ Text.unlines $ {-# SCC lsSorted #-} Vector.toList ({-# SCC sorted #-} sorted)
where
{-# SCC readFrequencies #-}
readFrequencies :: IO (HashMap Text Int)
readFrequencies = do
ls <- Text.lines <$> Text.readFile fileFrequencies
pure $ {-# SCC hmap #-} mkHashMap ({-# SCC ls #-} ls)
{- |
why not Lazy? read the file line by line, no need to hold it all in memory
-}
runV2 :: IO ()
runV2 = do
mapFrequencies <- readFrequencies
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
let sorted = quicksort mapFrequencies $ {-# SCC vec #-} Vector.fromList ({-# SCC ls #-} ls)
Text.writeFile fileSorted $ Text.unlines $ {-# sCC lsSorted #-} Vector.toList ({-# SCC sorted #-} sorted)
where
{-# SCC readFrequencies #-}
readFrequencies :: IO (HashMap Text Int)
readFrequencies = do
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
pure $ {-# SCC hmap #-} mkHashMap ({-# SCC ls #-} ls)
{-|
trying to help with garbage collection, only making it worse
-}
runV3 :: IO ()
runV3 = do
mapFrequencies <- readFrequencies
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
let -- alternatives:
-- Vector.fromListN (length ls) ls
-- Vector.generate (length ls) $ \i -> ls !! i
vec = {-# SCC vec #-} Vector.fromList ({-# SCC ls #-} ls)
-- the idea: ls can get garbage-collected ...
sorted = vec `seq` {-# SCC sorted #-} quicksort mapFrequencies vec
-- ... before we sort and write to the file
sorted `pseq` Lazy.writeFile fileSorted (Lazy.unlines $ Lazy.fromStrict <$> {-# SCC lsSorted #-} Vector.toList sorted)
where
readFrequencies :: IO (HashMap Text Int)
readFrequencies = do
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
pure $ {-# SCC hmap #-} mkHashMap ({-# SCC ls #-} ls)
freq :: HashMap Text Int -> Text -> Int
freq m w = fromMaybe 0 $ HashMap.lookup w m
quicksort ::
HashMap Text Int -> Vector Text -> Vector Text
quicksort freqs vec = runST $ do
mvec <- thaw vec
Tim.sortBy (comparing $ Down <<< freq freqs) mvec
freeze mvec
mkHashMap :: [Text] -> HashMap Text Int
mkHashMap ls =
HashMap.fromList $
catMaybes $
ls <&> \l -> case Text.head l of
'#' -> Nothing
_ ->
let [w, f] = Text.splitOn "\t" l
in Just (w, read $ Text.unpack f)
343,377,611,904 bytes allocated in the heap
1,345,257,485,736 bytes copied during GC
1,489,914,240 bytes maximum residency (1608 sample(s))
203,039,648 bytes maximum slop
2829 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 328286 colls, 0 par 12.067s 12.117s 0.0000s 0.0114s
Gen 1 1608 colls, 0 par 1001.504s 1001.547s 0.6229s 1.0471s
INIT time 0.000s ( 0.000s elapsed)
MUT time 160.134s (160.481s elapsed)
GC time 663.692s (663.771s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 349.879s (349.893s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1173.705s (1174.145s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,144,311,061 bytes per MUT second
Productivity 43.5% of total user, 43.5% of total elapsed
1The word frequency information has been provided to me by the Natural Language Processing Group, Uni Leipzig. It is generated out of a corpus of 35 Million sentences and distributed under the Creative Commons Attribution-NonCommercial 4.0 International Public Licence.
Upvotes: 5
Views: 282
Reputation: 1965
EDIT: I am adding new results on top. Below you can still see the less interesting results of earlier optimization.
That short peak on the right is the optimized code. The big peak on the left is there for comparison.
This is the code (provided by @bodigrim on discourse.haskell.org):
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import qualified Data.ByteString.Builder as BSB
import qualified Data.ByteString.Char8 as BS (lines, readInt)
import Data.List (sortOn)
import qualified Data.Map.Strict as Map
main :: IO ()
main = do
mapFrequencies <- Map.fromList . parseFrequencies <$> BS.readFile fileFrequencies
ls <- BS.lines <$> BS.readFile fileData
let sorted = sortOn (\k -> Map.findWithDefault 0 k mapFrequencies) ls
BSB.writeFile fileSorted $ foldMap ((<> "\n") . BSB.byteString) sorted
fileFrequencies :: FilePath
fileFrequencies = "deu_news_2020_freq.txt"
fileData :: FilePath
fileData = "german.utf8.dic"
fileSorted :: FilePath
fileSorted = "german.utf8.sorted.dic"
parseFrequencies :: BS.ByteString -> [(BS.ByteString, Int)]
parseFrequencies bs = case BS.uncons bs of
Nothing -> []
-- this is admittedly brittle, just to demonstrate single-pass parsing with readInt
Just (35, _) -> parseFrequencies (BS.unsafeTail (BS.dropWhile (/= 10) bs))
_ -> let (w, f) = BS.break (== 9) bs in
case BS.readInt (BS.unsafeTail f) of
Just (i, bs') -> (w, i) : parseFrequencies (BS.unsafeTail bs')
Nothing -> []
Data.List.sort
is amazing, allowing early garbage collection midst sortingHashMap
isn't worth the extra memory and thus Map
is fine, even though my lookups involve bytestring comparisonsmy old results
I achieved some sort of insight and a bit of optimization (new code here):
The leftmost graph is the heap profile of Version 1, using strict IO via Data.Text.IO
. It's the same heap, only I switched to analysis by type (via +RTS -hT) because my manual cost centers don't work:
ls <- Text.lines <$> Text.readFile fileData
-- ls then converted into vector
ls <- Text.lines <$> Text.readFile fileFrequencies
-- ls then converted into a strict hashmap
The middle graph is what I get for switching to lazy IO:
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
-- ls then converted into vector
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
-- ls then converted into a strict hashmap
There is some gain in "ARR_WORDS" but apart from that things got worse. I achieve the cleanest result in Version 3 using:
ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile fileData
-- ls then converted into vector
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
-- ls then converted into a strict hashmap
In conclusion, for reading a large files into my strict hashmap
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile file
seems to be the way to go. For reading a large file and converting the data into a vector via Vector.fromList
ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile file
seems to be necessary. That latter line doesn't have any disadvantage over the former (as far as I can see) and might become my standard way of reading text files line by line.
I learned, GHC's garbage collector uses two times the live memory. So given the heap profile, memory use of 2.4 GB should be expected.
I was able to optimize it via +RTS -A, i.e. setting the allocation area size which is 1 MB by default:
/usr/bin/env time -f '%M' cabal run readFilePerformance -- +RTS -s -A64M
45,138,177,376 bytes allocated in the heap
1,917,960,640 bytes copied during GC
569,104,376 bytes maximum residency (9 sample(s))
136,677,384 bytes maximum slop
1494 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 665 colls, 0 par 3.446s 3.446s 0.0052s 0.0248s
Gen 1 9 colls, 0 par 1.186s 1.186s 0.1318s 0.6386s
INIT time 0.000s ( 0.000s elapsed)
MUT time 19.486s ( 19.591s elapsed)
GC time 4.632s ( 4.632s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 24.118s ( 24.224s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,316,458,656 bytes per MUT second
Productivity 80.8% of total user, 80.9% of total elapsed
1532300
Compare that to the default:
/usr/bin/env time -f '%M' cabal run readFilePerformance -- +RTS -s
45,188,257,608 bytes allocated in the heap 2,943,601,200 bytes copied during GC
836,171,600 bytes maximum residency (14 sample(s))
197,200,048 bytes maximum slop
1889 MiB total memory in use (0 MB lost due to fragmentation)
45,188,257,608 bytes allocated in the heap
2,943,601,200 bytes copied during GC
836,171,600 bytes maximum residency (14 sample(s))
197,200,048 bytes maximum slop
1889 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 43127 colls, 0 par 3.157s 3.163s 0.0001s 0.0117s
Gen 1 14 colls, 0 par 1.954s 1.954s 0.1396s 1.0358s
INIT time 0.000s ( 0.000s elapsed)
MUT time 17.347s ( 17.465s elapsed)
GC time 5.111s ( 5.117s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 22.458s ( 22.582s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,604,931,438 bytes per MUT second
Productivity 77.2% of total user, 77.3% of total elapsed
1937440
I go down from 1.9 GB to about 1.5 GB RAM use at the cost of slightly increased runtime, increasing the allocation size to 64 MB. Increasing beyond that did not yield significantly better results.
Switch to a compacting algorithm for the garbage collector (via +RTS -c) didn't improve the memory footprint.
Upvotes: 3