MichaelChirico
MichaelChirico

Reputation: 34703

Removing all columns with a name on the fly

I'm using read_excel for speed and simplicity to import an Excel file.

Unfortunately, there's as yet no capability of excluding selected columns which won't be needed from the data set; to save effort, I am naming such columns "x" with the col_names argument, which is easier than trying to keep track of x1, x2, and so on.

I'd then like to exclude such columns on the fly if possible to avoid an extra step of copying, so in pseudocode:

read_excel("data.xlsx", col_names = c("x", "keep", "x"))[ , !"x"]

We can use the sample data set included with the readxl package for illustration:

library(readxl)
DF <- read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
                 col_names = c("x", "x", "length", "width", "x"), skip = 1L)

The approaches I've seen that work don't exactly work on the fly, e.g., having stored DF, we can now do:

DF <- DF[ , -grep("^x$", names(DF))]

This works but requires making a copy of DF by storing it, then overwriting it; I'd prefer to remove the columns in the same command as read_excel to allocate DF properly ab initio.

Other similar approaches require declaring temporary variables, which I prefer to avoid if possible, e.g.,

col_names <- c("x", "x", "length", "width", "x")
DF <- read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
                 col_names = col_names, skip = 1L)[ , -grep("^x$", col_names)]

Is there a way to axe these columns without creating unnecessary temporary variables?

(I could convert to data.table, but am wondering if there's a way to do so without data.table)

Upvotes: 0

Views: 265

Answers (2)

alistaire
alistaire

Reputation: 43334

There actually is a way to do this in readxl::read_excel, though it's a little hidden, and I have no idea if the columns are read into memory [temporarily] regardless. The trick is to specify column types, putting in "blank" for those you don't want:

readxl::read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
                   col_types = c('blank', 'blank', 'numeric', 'numeric', 'text'))
## # A tibble: 150 x 3
##    Petal.Length Petal.Width Species
##           <dbl>       <dbl>   <chr>
## 1           1.4         0.2  setosa
## 2           1.4         0.2  setosa
## 3           1.3         0.2  setosa
## 4           1.5         0.2  setosa
## 5           1.4         0.2  setosa
## 6           1.7         0.4  setosa
## 7           1.4         0.3  setosa
## 8           1.5         0.2  setosa
## 9           1.4         0.2  setosa
## 10          1.5         0.1  setosa
## # ... with 140 more rows

The caveat is that you need to know all the data types of the columns you want, though I suppose you could always just start with text and clean up later with type.convert or whatnot.

Upvotes: 2

Axeman
Axeman

Reputation: 35242

I don't see an easy way to avoid copying. But a one liner is achievable using piping, needing no temporary variables. E.g.:

library(magrittr)
read_excel(system.file("extdata/datasets.xlsx", package = "readxl"), 
           col_names = c("x", "x", "length", "width", "x"), skip = 1L) %>% 
  extract(, -grep("^x$", names(.))) -> 
  DF

Upvotes: 2

Related Questions