Reputation: 3656
I often need to filter out columns with a low variance from a data.table. The column names are not known in advance.
dt = data.table(mtcars)
# calculate standard deviation with arbitrary max value of 1:
mask = dt[,lapply(.SD, function(x) sd(x, na.rm = TRUE) > 1)]
# The columns with the FALSE values in row 1 need to be removed
mask.t = t(mask)
mask.t = which(mask.t)
dt[,mask.t,with=FALSE]
The approach above is clunky. Is there a more elegant way to filter out columns out of a data.table for which the column statistic evaluates to TRUE?
Upvotes: 3
Views: 741
Reputation: 3656
These work:
dt[, .SD, .SDcols=unlist(mask)]
dt[, .SD, .SDcols=which(unlist(mask))]
All together now:
variance.filter = function(df) {
mask = df[,lapply(.SD, function(x) sd(x,na.rm = TRUE) > 1)]
df = df[, .SD, .SDcols = unlist(mask)]
}
data.table
(1.12.9), .SDcols
accepts a function filter for columns, so this would work:variance.filter = function(df) {
df[ , .SD, .SDcols = function(x) sd(x, na.rm = TRUE) > 1]
}
Upvotes: 1