Alexander David
Alexander David

Reputation: 799

R summary statistics from dataframe by group

I apologize if this is a repeat, I really don't know the proper terminology for what I'm trying to achieve.

I have a dataframe of drug lab results as follows:

╔══════╦════════╗
║ drug ║ result ║
╠══════╬════════╣
║ A    ║     10 ║
║ B    ║    150 ║
║ B    ║     50 ║
║ A    ║     14 ║
║ C    ║      3 ║
║ C    ║      7 ║
╚══════╩════════╝

For each drug, I'm using dplyr to remove outliers (>4 SD's from the mean) using the following:

cleaned <- data %>% group_by(drug) %>% filter(abs(result-mean(result))/sd(result) < 4)

But now I would like to know how many outliers I'm removing per drug, so essentially I want to generate a dataframe that looks like the following:

╔══════╦═══════════╦══════════╦════════════╗
║ drug ║ total (N) ║ outliers ║ % outliers ║
╠══════╬═══════════╬══════════╬════════════╣
║ A    ║       100 ║        7 ║ 0.07       ║
║ B    ║       200 ║       45 ║ 0.225      ║
║ C    ║       300 ║       99 ║ 0.33       ║
╚══════╩═══════════╩══════════╩════════════╝

What is the best way to go about doing this?

Upvotes: 3

Views: 2114

Answers (2)

jazzurro
jazzurro

Reputation: 23574

Since there is no sample data, I decided to use the mtcars data set for demonstration. If I follow your approach, the following would be one way. Here, you want to find out the part of data you filtered out; you use setdiff() to collect the data. Since am is the group variable in this demonstration, use count() and find how many outliers exist for each group (i.e., either 0 or 1 for am). You further try to get a vector you need using select and unlist. Then, you use summarise() and count how many data points exist for am and add new columns using mutate().

library(dplyr)
library(tidyr)

mtcars %>%
group_by(am) %>%
filter(abs(disp-mean(disp))/sd(disp) < 1) %>%
setdiff(mtcars, .) %>%
count(am) %>%
select(2) %>%
unlist-> out

#out
#n1 n2 
#8  2 

summarize(group_by(mtcars, am), total = n()) %>%
mutate(outliers = out, percent = outliers / total)

#     am total outliers   percent
#  (dbl) (int)    (int)     (dbl)
#1     0    19        8 0.4210526
#2     1    13        2 0.1538462

Taking devmacrile's advice, I did the following. First, you group the data with a group variable. Then, you want to set up a flag column. Here, I created the column with mutate(). You have TRUE and FALSE in the column. You count how many data points exist by am and check in count(). You then reshape the outcome with spread() in the tidyr package. Now you calculate total data point for the 0 group and 1 group in am. Once again, you group the data with am, and finally you handle the percentage calculation and column renaming in transmute(). I hope this sample will help you.

mtcars %>%
group_by(am) %>%
mutate(check = abs(disp-mean(disp))/sd(disp) < 1)  %>%
count(am, check) %>%
spread(check, n) %>%
mutate(total = `FALSE` + `TRUE`) %>%
group_by(am) %>%
transmute(total, outliers = `FALSE`, percentage = `FALSE` / total)

#     am total outliers percentage
#  (dbl) (int)    (int)      (dbl)
#1     0    19        8  0.4210526
#2     1    13        2  0.1538462

Upvotes: 4

devmacrile
devmacrile

Reputation: 460

Instead of going right to filter(), I would create a flag field (i.e. 1 or 0) indicating whether or not the result was an outlier, and then pipe that into the appropriate summarization.

Upvotes: 2

Related Questions