asg0451
asg0451

Reputation: 503

Fastest way to parse a csv file into a vector of vectors and then spit it back out?

I'm trying to find the fastest method to reorder columns of a csv file (using the simple csv subset where there are no commas in cells). The reordering I'm doing through Vector.backpermute and that is fine; the bottleneck as indicated by RTS -p is the constructing of the vector of vectors that I do this operation on. The code below is the fastest version I could come up with. Anyone have any ideas?

{-# LANGUAGE OverloadedStrings #-}
module Main where

import           Control.Applicative
import           Control.Monad
import qualified Data.ByteString            as B
import qualified Data.ByteString.Builder    as BB
import qualified Data.ByteString.Lazy       as BL
import qualified Data.ByteString.Lazy.Char8 as BL8
import           Data.Char
import           Data.Foldable
import           Data.Monoid
import qualified Data.Vector                as V
import           Data.Word
import           Debug.Trace
import           System.Environment
import           System.IO

data Args = Args { cols :: V.Vector Int, filePath :: FilePath } deriving (Show)

--
w8 = fromIntegral . ord
mconcat' :: (Foldable t, Monoid a) => t a -> a
mconcat' = foldl' (<>) mempty

parseArgs :: [String] -> Args
parseArgs [colStr, filePath] = Args ((\n -> n-1) . read <$> V.fromList (split ',' colStr)) filePath
  where split :: Char -> String -> [String]
        split d str = gosplit d str []
        gosplit d "" acc = reverse acc
        gosplit d str acc = gosplit d (drop 1 $ dropWhile (/= d) str) $ takeWhile (/= d) str : acc

reorder :: Args -> BL.ByteString -> BB.Builder
reorder (Args cols _ ) bstr =
  -- transform to vec matrix
  let rows = V.filter (not . BL.null) $ V.fromList $ BL.split (w8 '\n') bstr
      m = (V.fromList . BL.split (w8 ',')) <$> rows -- n^2
  -- reorder
      m' = (flip V.backpermute) cols <$> m
  -- build back to bytestring
      numRows = length m'
      numCols = length cols
      builderM = mconcat' . V.imap (\i v -> BB.lazyByteString v <> (if i < numCols - 1 then "," else "")) <$> m'
      builderM' = mconcat' . V.imap (\i v -> v <> (if i < numRows - 1 then "\n" else "")) $ builderM
  in builderM'

main :: IO ()
main = do
  args <- parseArgs <$> getArgs

  withFile (filePath args) ReadMode $ \h -> do
    csvData <- BL.hGetContents h
    BB.hPutBuilder stdout $ reorder args csvData

The program is invoked like: $ reorder 2,1 x.csv which says give me the second and then the first column for all the rows of that csv, so you can ignore the argument parsing bit.

Upvotes: 1

Views: 128

Answers (1)

Jack Henahan
Jack Henahan

Reputation: 1386

I feel like you're working too hard. Manually building up and transforming all this data is error-prone and difficult to reason about (at least for me). cassava is made for this kind of task.

I can't totally unwrap the structure of your data from the code you've presented, so I'm going to use a trivial example to demonstrate how to achieve the goal "reorder such and such column(s)".

Suppose you have a CSV describing a list of people and their ages.

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}

import Data.Text
import Data.Csv
import Data.Vector

data Person = Person { name :: !Text , age :: !Int } deriving (Generic, Show)

-- We want to read and write TSVs

decodeOpt :: DecodeOptions
decodeOpt = defaultDecodeOptions { decDelimiter = fromIntegral (ord '\t') }

encodeOpt :: EncodeOptions
encodeOpt = defaultEncodeOptions { encDelimiter = fromIntegral (ord '\t') }

-- NB: Ideally, your encode and decode should be inverses, but these aren't
dec :: FromRecord a => HasHeader -> ByteString -> Either String (Vector a)
dec = decodeWith decodeOpt

enc :: ToRecord a => [a] -> ByteString
enc = encodeWith encodeOpt

Now we're gonna do some magic:

instance FromRecord Person

instance ToRecord Person where
  toRecord (Person name age) = record [ toField age, toField name ]

And now we can take

dec NoHeader "Roy\t30\r\nJim\t32" :: Either String (Vector Person)

and get

Right [Person {name = "Roy", age = 30}
      ,Person {name = "Jim", age = 32}]

And then reserialize them

enc [Person "Roy" 30, Person "Jim" 32]

with our result

"30\tRoy\r\n32\tJim\r\n"

So, this is all well and good assuming you're interested in just index-based column manipulation. If your CSV has column-names, you can be even more direct about things.

instance ToNamedRecord Person
instance DefaultOrdered Person
instance FromNamedRecord Person

-- NB: Ideally, your encode and decode should be inverses, but these aren't
decName ::FromNamedRecord a => ByteString -> Either String (Header, Vector a)
decName = decodeByNameWith decodeOpt

encName :: ToNamedRecord a => [a] -> ByteString
encName = encodeByNameWith encodeOpt (header ["age", "name"])

Now we can do this

encName [Person "Roy" 30, Person "Jim" 32]

and get

"age\tname\r\n30\tRoy\r\n32\tJim\r\n"

or

decName "name\tage\r\nRoy\t30\r\nJim\t32" :: Either String (Header, Vector Person)

to get

Right ( ["name","age"]
      , [Person { name = "Roy", age = 30 }
      , Person { name = "Jim", age = 32 }] )

Finally, if you really don't want any structure, cassava can deal with that, too.

dec NoHeader "Roy\t30\r\nJim\t32\r\n" :: Either String (Vector (Vector ByteString))

Which gives us

Right [["Roy","30"],["Jim","32"]]

And

enc [["Roy","30"],["Jim","32"]]

gives us

"Roy\t30\r\nJim\t32\r\n"

In this case, they're just regular lists, so you can do whatever transformations you like on the sublists to rearrange columns as you please.

Upvotes: 1

Related Questions