Fastest way to parse a csv file into a vector of vectors and then spit it back out?

Question

I'm trying to find the fastest method to reorder columns of a csv file (using the simple csv subset where there are no commas in cells). The reordering I'm doing through Vector.backpermute and that is fine; the bottleneck as indicated by RTS -p is the constructing of the vector of vectors that I do this operation on. The code below is the fastest version I could come up with. Anyone have any ideas?

{-# LANGUAGE OverloadedStrings #-}
module Main where

import           Control.Applicative
import           Control.Monad
import qualified Data.ByteString            as B
import qualified Data.ByteString.Builder    as BB
import qualified Data.ByteString.Lazy       as BL
import qualified Data.ByteString.Lazy.Char8 as BL8
import           Data.Char
import           Data.Foldable
import           Data.Monoid
import qualified Data.Vector                as V
import           Data.Word
import           Debug.Trace
import           System.Environment
import           System.IO

data Args = Args { cols :: V.Vector Int, filePath :: FilePath } deriving (Show)

--
w8 = fromIntegral . ord
mconcat' :: (Foldable t, Monoid a) => t a -> a
mconcat' = foldl' (<>) mempty

parseArgs :: [String] -> Args
parseArgs [colStr, filePath] = Args ((
 -> n-1) . read <$> V.fromList (split ',' colStr)) filePath
  where split :: Char -> String -> [String]
        split d str = gosplit d str []
        gosplit d "" acc = reverse acc
        gosplit d str acc = gosplit d (drop 1 $ dropWhile (/= d) str) $ takeWhile (/= d) str : acc

reorder :: Args -> BL.ByteString -> BB.Builder
reorder (Args cols _ ) bstr =
  -- transform to vec matrix
  let rows = V.filter (not . BL.null) $ V.fromList $ BL.split (w8 '
') bstr
      m = (V.fromList . BL.split (w8 ',')) <$> rows -- n^2
  -- reorder
      m' = (flip V.backpermute) cols <$> m
  -- build back to bytestring
      numRows = length m'
      numCols = length cols
      builderM = mconcat' . V.imap (\i v -> BB.lazyByteString v <> (if i < numCols - 1 then "," else "")) <$> m'
      builderM' = mconcat' . V.imap (\i v -> v <> (if i < numRows - 1 then "
" else "")) $ builderM
  in builderM'

main :: IO ()
main = do
  args <- parseArgs <$> getArgs

  withFile (filePath args) ReadMode $ \h -> do
    csvData <- BL.hGetContents h
    BB.hPutBuilder stdout $ reorder args csvData

The program is invoked like: $ reorder 2,1 x.csv which says give me the second and then the first column for all the rows of that csv, so you can ignore the argument parsing bit.

Fastest way to parse a csv file into a vector of vectors and then spit it back out?

Answers (1)

Related Questions