carnust
carnust

Reputation: 621

identify and remove single valued columns from table in R

I have a reasonably large dataset (~250k rows and 400 cols @ .5gb) where a number of columns are single valued (ie they only have one value). To remove these columns from the dataset I use data[, apply(data, 2, function(x) length(unique(x)) != 1)] which works fine. I was wondering if there might be a more efficient way of doing this? This on my pc takes:

> system.time(apply(data, 2, function(x) length(unique(x))))
#   user  system elapsed 
#  34.37    0.71   35.15 

Which isnt so bad for one data set, but I'd like to repeat multiple times on different datasets.

Upvotes: 0

Views: 89

Answers (2)

akrun
akrun

Reputation: 887851

You may also try:

set.seed(40)
df <- as.data.frame(matrix(sample(letters[1:3], 3*10,replace=TRUE), ncol=10))
Filter(function(x) (length(unique(x))>1), df)

Or

df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)] #still better than `apply`

Including these also in speed comparison (@beginneR's sample data)

 microbenchmark(
 new ={Filter(function(x) (length(unique(x))>1), df)},
 new1={df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)]},
 apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
 lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
 unit = "relative", 
 times = 100)
 # Unit: relative
 #  expr        min         lq    median         uq      max neval
 #   new  1.0000000  1.0000000  1.000000  1.0000000 1.000000   100
 #  new1  4.3741503  4.5144133  4.063634  3.9591345 1.713178   100
 # apply 23.9635826 24.0895813 21.361140 20.7650416 5.757233   100
 #lapply  0.9991514  0.9979483  1.002005  0.9958308 1.002603   100

Upvotes: 1

talat
talat

Reputation: 70336

You can use lapply instead:

data[, unlist(lapply(data, function(x) length(unique(x)) > 1L))]

Note that I added unlist to convert the resulting list to a vector of TRUE / FALSE values which will be used for the subsetting.

Edit: here's a little benchmark:

library(benchmark)

a <- runif(1e4)
b <- 99
c <- sample(LETTERS, 1e4, TRUE)
df <- data.frame(a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c)

microbenchmark(
  apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
  lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
  unit = "relative",
  times = 100)

#Unit: relative
#  expr      min       lq   median       uq      max neval
#apply  41.29383 40.06719 39.72256 39.16569 28.54078   100
#lapply  1.00000  1.00000  1.00000  1.00000  1.00000   100

Note that apply will first convert the data.frame to matrix and then perform the operation, which is less efficient. So in most cases where you're working with data.frames you can (and should) avoid using apply and use e.g. lapply instead.

Upvotes: 1

Related Questions