Reputation: 338
I have a data.table object that contains group column. I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. My data.table can be build using simple script:
col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)
I've been trying to split data.frame by group variable, however, it's too messy approach. How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations?
Upvotes: 1
Views: 2703
Reputation: 1481
Assuming that you want to remove outliers according to both col1
and col2
, based on the 95% quantile:
dt_filt <- dt[,
.SD[
((col1 < quantile(col1, probs = 0.95)) &
(col2 < quantile(col2, probs = 0.95)))
], by = group
]
which basically splits the data based on the group
column, calculates the thresholds, and then subsets the data to keep only rows where col1
and col2
are lower than the thresholds.
Upvotes: 6