Henk
Henk

Reputation: 3656

Filter out data.table columns based on summary statistics

I often need to filter out columns with a low variance from a data.table. The column names are not known in advance.

dt = data.table(mtcars)

# calculate standard deviation with arbitrary max value of 1:
mask = dt[,lapply(.SD, function(x) sd(x, na.rm = TRUE) > 1)]

# The columns with the FALSE values in row 1 need to be removed
mask.t = t(mask)
mask.t = which(mask.t)
dt[,mask.t,with=FALSE] 

The approach above is clunky. Is there a more elegant way to filter out columns out of a data.table for which the column statistic evaluates to TRUE?

Upvotes: 3

Views: 741

Answers (1)

Henk
Henk

Reputation: 3656

These work:

dt[, .SD, .SDcols=unlist(mask)] 

dt[, .SD, .SDcols=which(unlist(mask))]

All together now:

variance.filter = function(df) {
  mask = df[,lapply(.SD, function(x) sd(x,na.rm = TRUE) > 1)]
  df = df[, .SD, .SDcols = unlist(mask)] 
}

EDIT in the current development version of data.table (1.12.9), .SDcols accepts a function filter for columns, so this would work:

variance.filter = function(df) {
  df[ , .SD, .SDcols = function(x) sd(x, na.rm = TRUE) > 1]
}

Upvotes: 1

Related Questions