Reputation: 10761
I have a little script to read in, parse and derive some kind of interesting (not really) statistics from an apache log file. So far I've made two simple options, the total number of bytes sent in all requests in the log file, and a top 10 of the most common IP adresses.
The first "mode" is just a simple sum of all the parsed bytes. The second one is a fold over a map (Data.Map), using insertWith (+) 1'
to count the occurrences.
The first one runs as I expected, most of the time spent parsing, in constant space.
42,359,709,344 bytes allocated in the heap 72,405,840 bytes copied during GC 113,712 bytes maximum residency (1553 sample(s)) 145,872 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 76311 collections,
0 parallel, 0.89s, 0.99s elapsed
Generation 1: 1553 collections, 0 parallel, 0.21s, 0.22s elapsedINIT time 0.00s ( 0.00s elapsed) MUT time 21.76s ( 24.82s elapsed) GC time 1.10s ( 1.20s elapsed) EXIT time
0.00s ( 0.00s elapsed) Total time 22.87s ( 26.02s elapsed)%GC time 4.8% (4.6% elapsed)
Alloc rate 1,946,258,962 bytes per MUT second
Productivity 95.2% of total user, 83.6% of total elapsed
However, the second one does not!
49,398,834,152 bytes allocated in the heap 580,579,208 bytes copied during GC 718,385,088 bytes maximum residency (15 sample(s)) 134,532,128 bytes maximum slop 1393 MB total memory in use (172 MB lost due to fragmentation)
Generation 0: 91275 collections,
0 parallel, 252.65s, 254.46s elapsed
Generation 1: 15 collections, 0 parallel, 0.12s, 0.12s elapsedINIT time 0.00s ( 0.00s elapsed) MUT time 41.11s ( 48.87s elapsed) GC time 252.77s (254.58s elapsed) EXIT time
0.00s ( 0.01s elapsed) Total time 293.88s (303.45s elapsed)%GC time 86.0% (83.9% elapsed)
Alloc rate 1,201,635,385 bytes per MUT second
Productivity 14.0% of total user, 13.5% of total elapsed
And here is the code.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Attoparsec.Lazy as AL
import Data.Attoparsec.Char8 hiding (space, take)
import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Monad (liftM)
import System.Environment (getArgs)
import Prelude hiding (takeWhile)
import qualified Data.Map as M
import Data.List (foldl', sortBy)
import Text.Printf (printf)
import Data.Maybe (fromMaybe)
type Command = String
data LogLine = LogLine {
getIP :: S.ByteString,
getIdent :: S.ByteString,
getUser :: S.ByteString,
getDate :: S.ByteString,
getReq :: S.ByteString,
getStatus :: S.ByteString,
getBytes :: S.ByteString,
getPath :: S.ByteString,
getUA :: S.ByteString
} deriving (Ord, Show, Eq)
quote, lbrack, rbrack, space :: Parser Char
quote = satisfy (== '\"')
lbrack = satisfy (== '[')
rbrack = satisfy (== ']')
space = satisfy (== ' ')
quotedVal :: Parser S.ByteString
quotedVal = do
quote
res <- takeTill (== '\"')
quote
return res
bracketedVal :: Parser S.ByteString
bracketedVal = do
lbrack
res <- takeTill (== ']')
rbrack
return res
val :: Parser S.ByteString
val = takeTill (== ' ')
line :: Parser LogLine
l ine = do
ip <- val
space
identity <- val
space
user <- val
space
date <- bracketedVal
space
req <- quotedVal
space
status <- val
space
bytes <- val
(path,ua) <- option ("","") combined
return $ LogLine ip identity user date req status bytes path ua
combined :: Parser (S.ByteString,S.ByteString)
combined = do
space
path <- quotedVal
space
ua <- quotedVal
return (path,ua)
countBytes :: [L.ByteString] -> Int
countBytes = foldl' count 0
where
count acc l = case AL.maybeResult $ AL.parse line l of
Just x -> (acc +) . maybe 0 fst . S.readInt . getBytes $ x
Nothing -> acc
countIPs :: [L.ByteString] -> M.Map S.ByteString Int
countIPs = foldl' count M.empty
where
count acc l = case AL.maybeResult $ AL.parse line l of
Just x -> M.insertWith' (+) (getIP x) 1 acc
Nothing -> acc
---------------------------------------------------------------------------------
main :: IO ()
main = do
[cmd,path] <- getArgs
dispatch cmd path
pretty :: Show a => Int -> (a, Int) -> String
pretty i (bs, n) = printf "%d: %s, %d" i (show bs) n
dispatch :: Command -> FilePath -> IO ()
dispatch cmd path = action path
where
action = fromMaybe err (lookup cmd actions)
err = printf "Error: %s is not a valid command." cmd
actions :: [(Command, FilePath -> IO ())]
actions = [("bytes", countTotalBytes)
,("ips", topListIP)]
countTotalBytes :: FilePath -> IO ()
countTotalBytes path = print . countBytes . L.lines =<< L.readFile path
topListIP :: FilePath -> IO ()
topListIP path = do
f <- liftM L.lines $ L.readFile path
let mostPopular (_,a) (_,b) = compare b a
m = countIPs f
mapM_ putStrLn . zipWith pretty [1..] . take 10 . sortBy mostPopular . M.toList $ m
Edit:
Adding +RTS -A16M reduced GC to 20%. Memory use of course unchanged.
Upvotes: 2
Views: 553
Reputation: 9009
I suggest making the following changes to the code:
@@ -1,4 +1,4 @@
-{-# LANGUAGE OverloadedStrings #-}
+{-# LANGUAGE BangPatterns, OverloadedStrings #-}
module Main where
@@ -9,7 +9,7 @@
import Control.Monad (liftM)
import System.Environment (getArgs)
import Prelude hiding (takeWhile)
-import qualified Data.Map as M
+import qualified Data.HashMap.Strict as M
import Data.List (foldl', sortBy)
import Text.Printf (printf)
import Data.Maybe (fromMaybe)
@@ -17,15 +17,15 @@
type Command = String
data LogLine = LogLine {
- getIP :: S.ByteString,
- getIdent :: S.ByteString,
- getUser :: S.ByteString,
- getDate :: S.ByteString,
- getReq :: S.ByteString,
- getStatus :: S.ByteString,
- getBytes :: S.ByteString,
- getPath :: S.ByteString,
- getUA :: S.ByteString
+ getIP :: !S.ByteString,
+ getIdent :: !S.ByteString,
+ getUser :: !S.ByteString,
+ getDate :: !S.ByteString,
+ getReq :: !S.ByteString,
+ getStatus :: !S.ByteString,
+ getBytes :: !S.ByteString,
+ getPath :: !S.ByteString,
+ getUA :: !S.ByteString
} deriving (Ord, Show, Eq)
quote, lbrack, rbrack, space :: Parser Char
@@ -39,14 +39,14 @@
quote
res <- takeTill (== '\"')
quote
- return res
+ return $! res
bracketedVal :: Parser S.ByteString
bracketedVal = do
lbrack
res <- takeTill (== ']')
rbrack
- return res
+ return $! res
val :: Parser S.ByteString
val = takeTill (== ' ')
@@ -67,14 +67,14 @@
space
bytes <- val
(path,ua) <- option ("","") combined
- return $ LogLine ip identity user date req status bytes path ua
+ return $! LogLine ip identity user date req status bytes path ua
combined :: Parser (S.ByteString,S.ByteString)
combined = do
space
- path <- quotedVal
+ !path <- quotedVal
space
- ua <- quotedVal
+ !ua <- quotedVal
return (path,ua)
countBytes :: [L.ByteString] -> Int
@@ -84,11 +84,11 @@
Just x -> (acc +) . maybe 0 fst . S.readInt . getBytes $ x
Nothing -> acc
-countIPs :: [L.ByteString] -> M.Map S.ByteString Int
+countIPs :: [L.ByteString] -> M.HashMap S.ByteString Int
countIPs = foldl' count M.empty
where
count acc l = case AL.maybeResult $ AL.parse line l of
- Just x -> M.insertWith' (+) (getIP x) 1 acc
+ Just x -> M.insertWith (+) (getIP x) 1 acc
Nothing -> acc
---------------------------------------------------------------------------------
I made the fields of LogLine
strict to avoid them containing thunks referring to expressions related to parsing. It's good practice to make fields strict, unless you really need them to be lazy.
I made sure that the parse result is created as soon as possible (that's the $!
part of the change), also to avoid delaying the parsing until you actually inspect the individual fields of LogLine
.
Finally I switched to a better data structure, HashMap
from the unordered-containers package. Note that all functions in Data.HashMap.Strict
are value strict, which means we can use the plain insertWith
variant.
Note that taking a sub-string of a ByteString
forces the original string to be retained in memory, due to sharing the underlying storage (this is the same as for Java's String
). If you want to make sure that no extra memory is retained, use the copy
function from the bytestring
package. You can try to call copy
on the result of (getIP x)
and see if that makes any difference. The trade-off here is using some extra computation to copy the string in return for lower space usage.
Note that using -A<high number>
tend to improve performance of short running programs (i.e. benchmarks) but not necessarily on real programs. Same goes for -H
. At least a higher -H
value (e.g. 1G) doesn't hurt the performance of your program.
Upvotes: 3
Reputation: 1796
The most obvious point is that your first script can throw away data as soon as it's seen it, whereas the second one must hold on to everything it's seen. Hence, you'd expect the second script to take at least O(N) memory whereas the first can run in constant space.
Have you tried running with heap profiling turned on? I could make some stabs at where the excess allocations are likely to be happening in your code, but there's no substitute for hard data.
I'd be eying the Data.Map.insertWith' calls with suspicion myself, since each one renders a chunk of the extant Map surplus to requirements and requires copying and rebalancing but that's pure guesswork on my part. If the insertWith' calls are to blame, then since you don't need interstitial Map entries, it might be faster to build the entire map in one pass (without any increments to count IPs) and then do a second pass to do the counts. That way you won't waste time rebalancing the Map. You could also take advantage of the fact that your key datatype fits into an Int (well, it does if it's an IPv4 address at least) and use a Data.IntMap instead, which has much lower memory overhead.
Upvotes: 0