identify and remove single valued columns from table in R

Question

I have a reasonably large dataset (~250k rows and 400 cols @ .5gb) where a number of columns are single valued (ie they only have one value). To remove these columns from the dataset I use data[, apply(data, 2, function(x) length(unique(x)) != 1)] which works fine. I was wondering if there might be a more efficient way of doing this? This on my pc takes:

> system.time(apply(data, 2, function(x) length(unique(x))))
#   user  system elapsed 
#  34.37    0.71   35.15

Which isnt so bad for one data set, but I'd like to repeat multiple times on different datasets.

talat · Accepted Answer

You can use lapply instead:

data[, unlist(lapply(data, function(x) length(unique(x)) > 1L))]

Note that I added unlist to convert the resulting list to a vector of TRUE / FALSE values which will be used for the subsetting.

Edit: here's a little benchmark:

library(benchmark)

a <- runif(1e4)
b <- 99
c <- sample(LETTERS, 1e4, TRUE)
df <- data.frame(a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c)

microbenchmark(
  apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
  lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
  unit = "relative",
  times = 100)

#Unit: relative
#  expr      min       lq   median       uq      max neval
#apply  41.29383 40.06719 39.72256 39.16569 28.54078   100
#lapply  1.00000  1.00000  1.00000  1.00000  1.00000   100

Note that apply will first convert the data.frame to matrix and then perform the operation, which is less efficient. So in most cases where you're working with data.frames you can (and should) avoid using apply and use e.g. lapply instead.

identify and remove single valued columns from table in R

Answers (2)

Related Questions