gsmafra
gsmafra

Reputation: 2494

Read CSV in R and filter columns by name

Let's say I have a CSV with dozens or hundreds of columns and I want to pull in just about 2 or 3 columns. I know about the colClasses solution as described here but the code gets very unreadable.

I want something like usecols from pandas' read_csv.

Loading everything and just selecting afterwards is not a solution (the file is super big, it doesn't fit in memory).

Upvotes: 4

Views: 2331

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

One way is to use package sqldf. If you know SQL, it is possible to read in large files filtering only the parts you want.

I will use built-in dataset iris to make the example reproducible, saving it to disk first.

write.csv(iris, "iris.csv", row.names = FALSE)

Now the problem.
Argument row.names is like in the write.csv instruction.
Note the backticks around Sepal.Length. This is due to the dot character in the column name.

library(sqldf)

sql <- "select `Sepal.Length`, Species from file"
sub_iris <- read.csv.sql("iris.csv", sql = sql, row.names = FALSE)

head(sub_iris)
#  Sepal.Length  Species
#1          5.1 "setosa"
#2          4.9 "setosa"
#3          4.7 "setosa"
#4          4.6 "setosa"
#5          5.0 "setosa"
#6          5.4 "setosa"

And final clean up.

unlink("iris.csv")

Upvotes: 5

LocoGris
LocoGris

Reputation: 4480

I will use package data.table and then with fread() specify columns to keep/drop by arguments selector drop. From ?fread

select Vector of column names or numbers to keep, drop the rest.

drop Vector of column names or numbers to drop, keep the rest.

Best!

Upvotes: 4

Related Questions