Reputation: 31
I am trying to load a csv into R that is c. 180k rows and c.9k columns. I only need a subset of c.100 columns, and know the names of the columns that I need.
The answers for this question cover a lot of the different options for selecting a subset of columns when importing a csv.
Using those answers, and my own knowledge, I have tried to use read.csv and read.table from utils, fread from data.table, and read_csv from readr. All of these functions seem to import the whole csv and then subset the columns - which is very inefficient where I only need such a small subset of columns. I have also tried to use read.csv.sql from sqldf, which seemed promising as selecting specific columns is a very common SQL task, but I could not import a subset of columns because I got the error Error: too many columns on file
.
Two contributions from the community that would really help me out:
Many thanks!
P.S. I haven't asked many questions on SO before so I'd appreciate feedback if I need to ask in a different way.
Upvotes: 1
Views: 677
Reputation: 270348
There are many command line utilities (e.g. sed, awk, cut, csvfix, miller, csvkit, csvtk) that can do this. Below we use xsv (releases). This removes the unwanted columns before the file ever reaches R. Use the full path to xsv if iot is not already on the PATH. xsv
accepts column names as shown below or field numbers.
# write out test data
write.csv(iris, "iris-test.csv", quote = FALSE, row.names = FALSE)
cmd <- "xsv select Sepal.Length,Petal.Length-Species iris-test.csv"
DF <- read.csv(pipe(cmd))
head(DF)
## Sepal.Length Petal.Length Petal.Width Species
## 1 5.1 1.4 0.2 setosa
## 2 4.9 1.4 0.2 setosa
## 3 4.7 1.3 0.2 setosa
## 4 4.6 1.5 0.2 setosa
## 5 5.0 1.4 0.2 setosa
## 6 5.4 1.7 0.4 setosa
Or with UNIX cut (also available in Windows Rtools in \Rtools40\usr\bin in R 4.0+) the following works. Use the full path to cut
if cut
is not already on your PATH
.
cmd2 <- "cut -d, -f 1,3-5 iris-test.csv"
DF <- read.csv(pipe(cmd2))
Upvotes: 3
Reputation: 578
Here's what worked for me. I keep csv
files compressed in zstd
format for better performance compared to gzip
. If you use gzipped csv
files just replace zstd
with gunzip
and adjust the command line options.
You need to keep the zstd
binary downloaded from https://github.com/facebook/zstd/releases and csvtk
binary downloaded from https://bioinf.shenwei.me/csvtk/ in your system path.
Suppose you need to load only three columns, viz. YR_TA, MJH_CD, TV_TC_NO from a wide csv file (which also contains many other columns) into R while also specifying data types for the required columns.
The following code loads only specified columns from csv
file into R and R won't even know about other columns present in the csv
file.
library(data.table)
fyl <- "... path to your compressed csv file"
# define column data type specification for R
cols <- c(YR_TA="factor", MJH_CD="character", TV_TC_NO="character")
dt <- fread(cmd = paste("zstd -dcq", fyl, "| csvtk cut -f YR_TA,MJH_CD,TV_TC_NO"), select = cols)
Note that if the csv
file is really huge for your computer RAM, selecting columns from within fread
statement using select
will cause low memory error because fread
still needs to map the entire csv
file in memory before selection. So it's better that fread
sees only the required columns. Streaming only the required columns to fread
using the external tool csvtk
helps with that.
Upvotes: 0