Keynes2.0
Keynes2.0

Reputation: 31

What is the most efficient read csv function / package in R at importing a subset of csv columns

I am trying to load a csv into R that is c. 180k rows and c.9k columns. I only need a subset of c.100 columns, and know the names of the columns that I need.

The answers for this question cover a lot of the different options for selecting a subset of columns when importing a csv.

Using those answers, and my own knowledge, I have tried to use read.csv and read.table from utils, fread from data.table, and read_csv from readr. All of these functions seem to import the whole csv and then subset the columns - which is very inefficient where I only need such a small subset of columns. I have also tried to use read.csv.sql from sqldf, which seemed promising as selecting specific columns is a very common SQL task, but I could not import a subset of columns because I got the error Error: too many columns on file.

Two contributions from the community that would really help me out:

  1. Can anyone point to a different import function that would be a lot more efficient where I only need a subset of columns?
  2. Can anyone give more context about how these functions work and correct my misunderstanding or reassure me that the solution I'm looking for doesn't exist?

Many thanks!

P.S. I haven't asked many questions on SO before so I'd appreciate feedback if I need to ask in a different way.

Upvotes: 1

Views: 677

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270348

There are many command line utilities (e.g. sed, awk, cut, csvfix, miller, csvkit, csvtk) that can do this. Below we use xsv (releases). This removes the unwanted columns before the file ever reaches R. Use the full path to xsv if iot is not already on the PATH. xsv accepts column names as shown below or field numbers.

# write out test data
write.csv(iris, "iris-test.csv", quote = FALSE, row.names = FALSE)

cmd <- "xsv select Sepal.Length,Petal.Length-Species iris-test.csv"
DF <- read.csv(pipe(cmd))

head(DF)
##   Sepal.Length Petal.Length Petal.Width Species
## 1          5.1          1.4         0.2  setosa
## 2          4.9          1.4         0.2  setosa
## 3          4.7          1.3         0.2  setosa
## 4          4.6          1.5         0.2  setosa
## 5          5.0          1.4         0.2  setosa
## 6          5.4          1.7         0.4  setosa

Or with UNIX cut (also available in Windows Rtools in \Rtools40\usr\bin in R 4.0+) the following works. Use the full path to cut if cut is not already on your PATH.

cmd2 <- "cut -d, -f 1,3-5 iris-test.csv"
DF <- read.csv(pipe(cmd2))

Upvotes: 3

San
San

Reputation: 578

Here's what worked for me. I keep csv files compressed in zstd format for better performance compared to gzip. If you use gzipped csv files just replace zstd with gunzip and adjust the command line options.

You need to keep the zstd binary downloaded from https://github.com/facebook/zstd/releases and csvtk binary downloaded from https://bioinf.shenwei.me/csvtk/ in your system path.

Suppose you need to load only three columns, viz. YR_TA, MJH_CD, TV_TC_NO from a wide csv file (which also contains many other columns) into R while also specifying data types for the required columns.

The following code loads only specified columns from csv file into R and R won't even know about other columns present in the csv file.

library(data.table)
fyl <- "... path to your compressed csv file"

# define column data type specification for R
cols <- c(YR_TA="factor", MJH_CD="character", TV_TC_NO="character")

dt <- fread(cmd = paste("zstd -dcq", fyl, "| csvtk cut -f YR_TA,MJH_CD,TV_TC_NO"), select = cols)

Note that if the csv file is really huge for your computer RAM, selecting columns from within fread statement using select will cause low memory error because fread still needs to map the entire csv file in memory before selection. So it's better that fread sees only the required columns. Streaming only the required columns to fread using the external tool csvtk helps with that.

Upvotes: 0

Related Questions