Read numbers from stdin into a Data.Vector.Unboxed.Vector Int64

Question

Given is a text file (for piping) with many numbers divided by a space, like so:

234 456 345 ...

What is the best way to read them all into a Data.Vector.Unboxed.Vector Int64? My current code looks like this:

import Control.Applicative
import Control.Arrow
import Data.Int
import Data.Maybe
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed as V

main :: IO ()
main = do
    v <- readInts <$> B.getContents
    print $ V.maximum v

-- splitted for profiling
readInts :: B.ByteString -> V.Vector Int64
readInts = a >>> b >>> c >>> d

a = B.split ' '
b = mapMaybe (B.readInt >>> liftA fst)
c = map fromIntegral
d = V.fromList

Here is the profiler output

    Thu Sep 18 16:08 2014 Time and Allocation Profiling Report  (Final)

       FastReadInts +RTS -p -K800M -RTS

    total time  =        0.51 secs   (505 ticks @ 1000 us, 1 processor)
    total alloc = 1,295,988,256 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

d           Main     74.3    5.2
b           Main      9.9   35.6
a           Main      6.3   40.0
main        Main      4.8    0.0
c           Main      3.2   19.3


                                                        individual     inherited
COST CENTRE MODULE                    no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                       60           0    0.4    0.0   100.0  100.0
 main       Main                      121           0    4.8    0.0    98.2  100.0
  readInts  Main                      123           0    0.0    0.0    93.5  100.0
   a        Main                      131           0    6.1   40.0     6.1   40.0
   b        Main                      129           0    9.9   35.6     9.9   35.6
   c        Main                      127           0    3.2   19.3     3.2   19.3
   d        Main                      125           0   74.3    5.2    74.3    5.2
 CAF        Main                      119           0    0.0    0.0     0.2    0.0
  a         Main                      130           1    0.2    0.0     0.2    0.0
  b         Main                      128           1    0.0    0.0     0.0    0.0
  c         Main                      126           1    0.0    0.0     0.0    0.0
  d         Main                      124           1    0.0    0.0     0.0    0.0
  readInts  Main                      122           1    0.0    0.0     0.0    0.0
  main      Main                      120           1    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Handle.FD          103           0    0.6    0.0     0.6    0.0
 CAF        GHC.IO.Encoding            96           0    0.2    0.0     0.2    0.0
 CAF        GHC.IO.Handle.Internals    93           0    0.0    0.0     0.0    0.0
 CAF        GHC.Conc.Signal            83           0    0.2    0.0     0.2    0.0
 CAF        GHC.IO.Encoding.Iconv      81           0    0.2    0.0     0.2    0.0

The programm is compiled and run this way:

ghc -O2 -prof -auto-all -rtsopts FastReadInts.hs
./FastReadInts +RTS -p -K800M < many_numbers.txt

many_numbers.txt is about 14MB large.

How can this bottleneck, i.e. V.fromList, be removed?

Thomas M. DuBuisson · Accepted Answer

It is hard to answer questions like this without some expected level of performance or point of comparison. By simply omitting the profiling your code runs in 100ms over an ASCii file of 21MB of random 64-bit numbers, this seems reasonable to me.

$ time ./so < randoms.txt 
9223350746261547498

real    0m0.109s
user    0m0.094s
sys     0m0.013s

And the generation of the test data:

import System.Random

main = do
    g <- newStdGen
    let rs = take (2^20) $ randomRs (0,2^64) g :: [Integer]
    writeFile "randoms.txt" $ unwords (map show rs)

EDIT:

As requested:

import Data.Vector.Unboxed.Mutable as M
...
listToVector :: [Int64] -> V.Vector Int64
listToVector ls = unsafePerformIO $ do
        m <- M.unsafeNew (2^20)
        zipWithM_ (M.unsafeWrite m) [0..(2^20)-1] ls
        V.unsafeFreeze m

Read numbers from stdin into a Data.Vector.Unboxed.Vector Int64

Answers (2)

Related Questions