Reputation: 799
I apologize if this is a repeat, I really don't know the proper terminology for what I'm trying to achieve.
I have a dataframe of drug lab results as follows:
╔══════╦════════╗ ║ drug ║ result ║ ╠══════╬════════╣ ║ A ║ 10 ║ ║ B ║ 150 ║ ║ B ║ 50 ║ ║ A ║ 14 ║ ║ C ║ 3 ║ ║ C ║ 7 ║ ╚══════╩════════╝
For each drug, I'm using dplyr to remove outliers (>4 SD's from the mean) using the following:
cleaned <- data %>% group_by(drug) %>% filter(abs(result-mean(result))/sd(result) < 4)
But now I would like to know how many outliers I'm removing per drug, so essentially I want to generate a dataframe that looks like the following:
╔══════╦═══════════╦══════════╦════════════╗ ║ drug ║ total (N) ║ outliers ║ % outliers ║ ╠══════╬═══════════╬══════════╬════════════╣ ║ A ║ 100 ║ 7 ║ 0.07 ║ ║ B ║ 200 ║ 45 ║ 0.225 ║ ║ C ║ 300 ║ 99 ║ 0.33 ║ ╚══════╩═══════════╩══════════╩════════════╝
What is the best way to go about doing this?
Upvotes: 3
Views: 2114
Reputation: 23574
Since there is no sample data, I decided to use the mtcars data set for demonstration. If I follow your approach, the following would be one way. Here, you want to find out the part of data you filtered out; you use setdiff()
to collect the data. Since am
is the group variable in this demonstration, use count()
and find how many outliers exist for each group (i.e., either 0 or 1 for am
). You further try to get a vector you need using select and unlist. Then, you use summarise()
and count how many data points exist for am
and add new columns using mutate()
.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(am) %>%
filter(abs(disp-mean(disp))/sd(disp) < 1) %>%
setdiff(mtcars, .) %>%
count(am) %>%
select(2) %>%
unlist-> out
#out
#n1 n2
#8 2
summarize(group_by(mtcars, am), total = n()) %>%
mutate(outliers = out, percent = outliers / total)
# am total outliers percent
# (dbl) (int) (int) (dbl)
#1 0 19 8 0.4210526
#2 1 13 2 0.1538462
Taking devmacrile's advice, I did the following. First, you group the data with a group variable. Then, you want to set up a flag column. Here, I created the column with mutate()
. You have TRUE and FALSE in the column. You count how many data points exist by am
and check
in count()
. You then reshape the outcome with spread()
in the tidyr
package. Now you calculate total data point for the 0 group and 1 group in am
. Once again, you group the data with am
, and finally you handle the percentage calculation and column renaming in transmute()
. I hope this sample will help you.
mtcars %>%
group_by(am) %>%
mutate(check = abs(disp-mean(disp))/sd(disp) < 1) %>%
count(am, check) %>%
spread(check, n) %>%
mutate(total = `FALSE` + `TRUE`) %>%
group_by(am) %>%
transmute(total, outliers = `FALSE`, percentage = `FALSE` / total)
# am total outliers percentage
# (dbl) (int) (int) (dbl)
#1 0 19 8 0.4210526
#2 1 13 2 0.1538462
Upvotes: 4
Reputation: 460
Instead of going right to filter(), I would create a flag field (i.e. 1 or 0) indicating whether or not the result was an outlier, and then pipe that into the appropriate summarization.
Upvotes: 2