Reputation: 503
I'm trying to find the fastest method to reorder columns of a csv file (using the simple csv subset where there are no commas in cells). The reordering I'm doing through Vector.backpermute and that is fine; the bottleneck as indicated by RTS -p is the constructing of the vector of vectors that I do this operation on. The code below is the fastest version I could come up with. Anyone have any ideas?
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative
import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Builder as BB
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as BL8
import Data.Char
import Data.Foldable
import Data.Monoid
import qualified Data.Vector as V
import Data.Word
import Debug.Trace
import System.Environment
import System.IO
data Args = Args { cols :: V.Vector Int, filePath :: FilePath } deriving (Show)
--
w8 = fromIntegral . ord
mconcat' :: (Foldable t, Monoid a) => t a -> a
mconcat' = foldl' (<>) mempty
parseArgs :: [String] -> Args
parseArgs [colStr, filePath] = Args ((\n -> n-1) . read <$> V.fromList (split ',' colStr)) filePath
where split :: Char -> String -> [String]
split d str = gosplit d str []
gosplit d "" acc = reverse acc
gosplit d str acc = gosplit d (drop 1 $ dropWhile (/= d) str) $ takeWhile (/= d) str : acc
reorder :: Args -> BL.ByteString -> BB.Builder
reorder (Args cols _ ) bstr =
-- transform to vec matrix
let rows = V.filter (not . BL.null) $ V.fromList $ BL.split (w8 '\n') bstr
m = (V.fromList . BL.split (w8 ',')) <$> rows -- n^2
-- reorder
m' = (flip V.backpermute) cols <$> m
-- build back to bytestring
numRows = length m'
numCols = length cols
builderM = mconcat' . V.imap (\i v -> BB.lazyByteString v <> (if i < numCols - 1 then "," else "")) <$> m'
builderM' = mconcat' . V.imap (\i v -> v <> (if i < numRows - 1 then "\n" else "")) $ builderM
in builderM'
main :: IO ()
main = do
args <- parseArgs <$> getArgs
withFile (filePath args) ReadMode $ \h -> do
csvData <- BL.hGetContents h
BB.hPutBuilder stdout $ reorder args csvData
The program is invoked like: $ reorder 2,1 x.csv
which says give me the second and then the first column for all the rows of that csv, so you can ignore the argument parsing bit.
Upvotes: 1
Views: 128
Reputation: 1386
I feel like you're working too hard. Manually building up and transforming all this data is error-prone and difficult to reason about (at least for me). cassava
is made for this kind of task.
I can't totally unwrap the structure of your data from the code you've presented, so I'm going to use a trivial example to demonstrate how to achieve the goal "reorder such and such column(s)".
Suppose you have a CSV describing a list of people and their ages.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
import Data.Text
import Data.Csv
import Data.Vector
data Person = Person { name :: !Text , age :: !Int } deriving (Generic, Show)
-- We want to read and write TSVs
decodeOpt :: DecodeOptions
decodeOpt = defaultDecodeOptions { decDelimiter = fromIntegral (ord '\t') }
encodeOpt :: EncodeOptions
encodeOpt = defaultEncodeOptions { encDelimiter = fromIntegral (ord '\t') }
-- NB: Ideally, your encode and decode should be inverses, but these aren't
dec :: FromRecord a => HasHeader -> ByteString -> Either String (Vector a)
dec = decodeWith decodeOpt
enc :: ToRecord a => [a] -> ByteString
enc = encodeWith encodeOpt
Now we're gonna do some magic:
instance FromRecord Person
instance ToRecord Person where
toRecord (Person name age) = record [ toField age, toField name ]
And now we can take
dec NoHeader "Roy\t30\r\nJim\t32" :: Either String (Vector Person)
and get
Right [Person {name = "Roy", age = 30}
,Person {name = "Jim", age = 32}]
And then reserialize them
enc [Person "Roy" 30, Person "Jim" 32]
with our result
"30\tRoy\r\n32\tJim\r\n"
So, this is all well and good assuming you're interested in just index-based column manipulation. If your CSV has column-names, you can be even more direct about things.
instance ToNamedRecord Person
instance DefaultOrdered Person
instance FromNamedRecord Person
-- NB: Ideally, your encode and decode should be inverses, but these aren't
decName ::FromNamedRecord a => ByteString -> Either String (Header, Vector a)
decName = decodeByNameWith decodeOpt
encName :: ToNamedRecord a => [a] -> ByteString
encName = encodeByNameWith encodeOpt (header ["age", "name"])
Now we can do this
encName [Person "Roy" 30, Person "Jim" 32]
and get
"age\tname\r\n30\tRoy\r\n32\tJim\r\n"
or
decName "name\tage\r\nRoy\t30\r\nJim\t32" :: Either String (Header, Vector Person)
to get
Right ( ["name","age"]
, [Person { name = "Roy", age = 30 }
, Person { name = "Jim", age = 32 }] )
Finally, if you really don't want any structure, cassava
can deal with that, too.
dec NoHeader "Roy\t30\r\nJim\t32\r\n" :: Either String (Vector (Vector ByteString))
Which gives us
Right [["Roy","30"],["Jim","32"]]
And
enc [["Roy","30"],["Jim","32"]]
gives us
"Roy\t30\r\nJim\t32\r\n"
In this case, they're just regular lists, so you can do whatever transformations you like on the sublists to rearrange columns as you please.
Upvotes: 1