ruben.moor
ruben.moor

Reputation: 1965

Efficiently read and sort a file containing lines of text in Haskell

I happened to sort a list of german words by their natural frequency1. I am not happy about the memory performance of my algorithm.

enter image description here

The graphic is created with hp/D3.js. It shows the runtime heap for V1, V2, and V3, as given in the code below.

I uploaded the complete code including short instructions on how to run with profiling (via stack and nix) on github here. It is pasted complete below as well.

The Version 1 reads both large files using strict IO from Data.Text.IO. It can be seen quite nicely how there is a difference to Version 2 and 3 with Lazy IO from Data.Text.Lazy.IO: Something jumps into existence immediately, whereas the Version 2 and 3 build up.

Size of the data structures

I can give quite accurate sizes based on these formulae, and I know what's in the files, averaging the length of the german words there to about 16 characters. These numbers are not interpreted from the output, but rather calculated independendly.

What I don't understand

Apart from that I am completely lost. I am trying to understand those issues:

  1. Why are my {-# SCC foo #-} ignored? I don't have control over the cost centers in the profiling. This happens on both, GHC 8.8.4 and GHC 9.2.1, nix/cabal and stack alike.

  2. The profile suggests a peak memory usage of little more than 1 GB. However, running top, I can see that really the algorithm uses up to 2.6 GB. This is nearly double the amount. Shouldn't those amounts be equal?

  3. Where is the garbage collection happening? My suspicion is that there is none. The Versions 2 and 3 show some gargabe collection but only to the degree that their memory use was excessive over Version 1's.

  4. Can I expect a much leaner memory profile at all, given my choice of hashmap, list, and vector? Just adding the hashmap and the vector would amount to 717 MB, less than half of what I see in top. How to get there?

  5. Are there other, preferable devices for this kind of task? I chose vector for the sorting algorithm. I can't move to either of Storable, Unboxed, or Primitive because of Text (at least I don't know how).

  6. In the summary of the runtime statistics (s. below), it says "Productivity 43.5%". My guess would be, the profiling itself is part of the cause. But could it be that there is excessive activity of the garbage collector, too, based on the numbers?

-- app/Main.hs
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Category ((<<<))
import Control.Monad.ST (runST)
import Data.Functor ((<&>))
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as HashMap
import Data.Maybe (catMaybes, fromMaybe)
import Data.Ord (Down (Down), comparing)
import Data.Text (Text)
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Text.Lazy as Lazy
import qualified Data.Text.Lazy.IO as Lazy
import Data.Vector (Vector, freeze, thaw)
import qualified Data.Vector as Vector
import qualified Data.Vector.Algorithms.Tim as Tim
import System.IO (hFlush, stdout)
import GHC.Conc (pseq)

main :: IO ()
main = do
    putStr ""

    putStr "Running v1 ..."
    hFlush stdout
    u1 <- runV1
    putStrLn $ u1 `seq` " done."

    putStrLn ""
    putStr "Running v2 ..."
    hFlush stdout
    u2 <- runV2
    putStrLn $ u2 `seq` " done."

    putStrLn ""
    putStr "Running v3 ..."
    hFlush stdout
    u3 <- runV3
    putStrLn $ u3 `seq` " done."

fileFrequencies :: FilePath
fileFrequencies = "deu_news_2020_freq.txt"

fileData :: FilePath
fileData = "german.utf8.dic"

fileSorted :: FilePath
fileSorted = "german.utf8.sorted.dic"

{- |
straightforward implementation, using Text-based IO
-}
runV1 :: IO ()
runV1 = do
    mapFrequencies <- readFrequencies
    ls <- Text.lines <$> Text.readFile fileData
    let sorted = quicksort mapFrequencies $ {-# SCC vec #-} Vector.fromList ({-# SCC ls #-} ls)
    Text.writeFile fileSorted $ Text.unlines $ {-# SCC lsSorted #-} Vector.toList ({-# SCC sorted #-} sorted)
  where
    {-# SCC readFrequencies #-}
    readFrequencies :: IO (HashMap Text Int)
    readFrequencies = do
        ls <- Text.lines <$> Text.readFile fileFrequencies
        pure $ {-# SCC hmap #-} mkHashMap ({-# SCC ls #-} ls)

{- |
why not Lazy? read the file line by line, no need to hold it all in memory
-}
runV2 :: IO ()
runV2 = do
    mapFrequencies <- readFrequencies
    ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
    let sorted = quicksort mapFrequencies $ {-# SCC vec #-} Vector.fromList ({-# SCC ls #-} ls)
    Text.writeFile fileSorted $ Text.unlines $ {-# sCC lsSorted #-} Vector.toList ({-# SCC sorted #-} sorted)
  where
    {-# SCC readFrequencies #-}
    readFrequencies :: IO (HashMap Text Int)
    readFrequencies = do
        ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
        pure $ {-# SCC hmap #-} mkHashMap ({-# SCC ls #-} ls)

{-|
trying to help with garbage collection, only making it worse
-}
runV3 :: IO ()
runV3 = do
    mapFrequencies <- readFrequencies
    ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData

    let -- alternatives:
        --     Vector.fromListN (length ls) ls
        --     Vector.generate (length ls) $ \i -> ls !! i
        vec = {-# SCC vec #-} Vector.fromList ({-# SCC ls #-} ls)

        -- the idea: ls can get garbage-collected ...
        sorted = vec `seq` {-# SCC sorted #-} quicksort mapFrequencies vec

    -- ... before we sort and write to the file
    sorted `pseq` Lazy.writeFile fileSorted (Lazy.unlines $ Lazy.fromStrict <$> {-# SCC lsSorted #-} Vector.toList sorted)
  where
    readFrequencies :: IO (HashMap Text Int)
    readFrequencies = do
        ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
        pure $ {-# SCC hmap #-} mkHashMap ({-# SCC ls #-} ls)

freq :: HashMap Text Int -> Text -> Int
freq m w = fromMaybe 0 $ HashMap.lookup w m

quicksort ::
    HashMap Text Int -> Vector Text -> Vector Text
quicksort freqs vec = runST $ do
    mvec <- thaw vec
    Tim.sortBy (comparing $ Down <<< freq freqs) mvec
    freeze mvec

mkHashMap :: [Text] -> HashMap Text Int
mkHashMap ls =
    HashMap.fromList $
        catMaybes $
            ls <&> \l -> case Text.head l of
                '#' -> Nothing
                _ ->
                    let [w, f] = Text.splitOn "\t" l
                     in Just (w, read $ Text.unpack f)

Runtime statistics summary (+RTS -s)

 343,377,611,904 bytes allocated in the heap
1,345,257,485,736 bytes copied during GC
   1,489,914,240 bytes maximum residency (1608 sample(s))
     203,039,648 bytes maximum slop
            2829 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     328286 colls,     0 par   12.067s  12.117s     0.0000s    0.0114s
  Gen  1      1608 colls,     0 par   1001.504s  1001.547s     0.6229s    1.0471s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time  160.134s  (160.481s elapsed)
  GC      time  663.692s  (663.771s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time  349.879s  (349.893s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time  1173.705s  (1174.145s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,144,311,061 bytes per MUT second

  Productivity  43.5% of total user, 43.5% of total elapsed


1The word frequency information has been provided to me by the Natural Language Processing Group, Uni Leipzig. It is generated out of a corpus of 35 Million sentences and distributed under the Creative Commons Attribution-NonCommercial 4.0 International Public Licence.

Upvotes: 5

Views: 282

Answers (1)

ruben.moor
ruben.moor

Reputation: 1965

EDIT: I am adding new results on top. Below you can still see the less interesting results of earlier optimization.

enter image description here

That short peak on the right is the optimized code. The big peak on the left is there for comparison.

This is the code (provided by @bodigrim on discourse.haskell.org):

{-# LANGUAGE OverloadedStrings #-}

module Main where

import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import qualified Data.ByteString.Builder as BSB
import qualified Data.ByteString.Char8 as BS (lines, readInt)
import Data.List (sortOn)
import qualified Data.Map.Strict as Map

main :: IO ()
main = do
    mapFrequencies <- Map.fromList . parseFrequencies <$> BS.readFile fileFrequencies
    ls <- BS.lines <$> BS.readFile fileData
    let sorted = sortOn (\k -> Map.findWithDefault 0 k mapFrequencies) ls
    BSB.writeFile fileSorted $ foldMap ((<> "\n") . BSB.byteString) sorted

fileFrequencies :: FilePath
fileFrequencies = "deu_news_2020_freq.txt"

fileData :: FilePath
fileData = "german.utf8.dic"

fileSorted :: FilePath
fileSorted = "german.utf8.sorted.dic"

parseFrequencies :: BS.ByteString -> [(BS.ByteString, Int)]
parseFrequencies bs = case BS.uncons bs of
    Nothing -> []
    -- this is admittedly brittle, just to demonstrate single-pass parsing with readInt
    Just (35, _) -> parseFrequencies (BS.unsafeTail (BS.dropWhile (/= 10) bs))
    _ -> let (w, f) = BS.break (== 9) bs in
         case BS.readInt (BS.unsafeTail f) of
                Just (i, bs') -> (w, i) : parseFrequencies (BS.unsafeTail bs')
                Nothing -> []

Take-aways

  • custom bytestring parsing pays off hugely performance-wise
  • Data.List.sort is amazing, allowing early garbage collection midst sorting
  • in my case, any runtime performance gain of HashMap isn't worth the extra memory and thus Map is fine, even though my lookups involve bytestring comparisons

my old results

I achieved some sort of insight and a bit of optimization (new code here):

enter image description here

The leftmost graph is the heap profile of Version 1, using strict IO via Data.Text.IO. It's the same heap, only I switched to analysis by type (via +RTS -hT) because my manual cost centers don't work:

ls <- Text.lines <$> Text.readFile fileData
-- ls then converted into vector

ls <- Text.lines <$> Text.readFile fileFrequencies
-- ls then converted into a strict hashmap

The middle graph is what I get for switching to lazy IO:

ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
-- ls then converted into vector

ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
-- ls then converted into a strict hashmap

There is some gain in "ARR_WORDS" but apart from that things got worse. I achieve the cleanest result in Version 3 using:

ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile fileData
-- ls then converted into vector

ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
-- ls then converted into a strict hashmap

Tweaking strict/lazy evaluation

In conclusion, for reading a large files into my strict hashmap

ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile file

seems to be the way to go. For reading a large file and converting the data into a vector via Vector.fromList

ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile file

seems to be necessary. That latter line doesn't have any disadvantage over the former (as far as I can see) and might become my standard way of reading text files line by line.

Tweaking the garbage collector

I learned, GHC's garbage collector uses two times the live memory. So given the heap profile, memory use of 2.4 GB should be expected.

I was able to optimize it via +RTS -A, i.e. setting the allocation area size which is 1 MB by default:

/usr/bin/env time -f '%M' cabal run readFilePerformance -- +RTS -s -A64M

45,138,177,376 bytes allocated in the heap
   1,917,960,640 bytes copied during GC
     569,104,376 bytes maximum residency (9 sample(s))
     136,677,384 bytes maximum slop
            1494 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       665 colls,     0 par    3.446s   3.446s     0.0052s    0.0248s
  Gen  1         9 colls,     0 par    1.186s   1.186s     0.1318s    0.6386s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   19.486s  ( 19.591s elapsed)
  GC      time    4.632s  (  4.632s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   24.118s  ( 24.224s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,316,458,656 bytes per MUT second

  Productivity  80.8% of total user, 80.9% of total elapsed

1532300

Compare that to the default:

 /usr/bin/env time -f '%M' cabal run readFilePerformance -- +RTS -s

    45,188,257,608 bytes allocated in the heap    2,943,601,200 bytes copied during GC
         836,171,600 bytes maximum residency (14 sample(s))
         197,200,048 bytes maximum slop
                1889 MiB total memory in use (0 MB lost due to fragmentation)
    
                                              45,188,257,608 bytes allocated in the heap
   2,943,601,200 bytes copied during GC
     836,171,600 bytes maximum residency (14 sample(s))
     197,200,048 bytes maximum slop
            1889 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     43127 colls,     0 par    3.157s   3.163s     0.0001s    0.0117s
  Gen  1        14 colls,     0 par    1.954s   1.954s     0.1396s    1.0358s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   17.347s  ( 17.465s elapsed)
  GC      time    5.111s  (  5.117s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   22.458s  ( 22.582s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,604,931,438 bytes per MUT second

  Productivity  77.2% of total user, 77.3% of total elapsed

1937440

I go down from 1.9 GB to about 1.5 GB RAM use at the cost of slightly increased runtime, increasing the allocation size to 64 MB. Increasing beyond that did not yield significantly better results.

Switch to a compacting algorithm for the garbage collector (via +RTS -c) didn't improve the memory footprint.


Upvotes: 3

Related Questions