Reputation: 99
I have a data set where participants were assigned to different groups and completed the same tests. I know I can use the aggregate function to identify the mean and sd but I cannot figure out how to find the outliers in these groups.
df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
I like the format of this code but do not know how to change it in order to identify outliers for each group for each test.
ALSO, I want outliers to be considered anything greater than 2 standard deviations rather than 3. Can I format that too within this code?
##to get outliers on test1 if groups were combined
badexample <- boxplot(df$test1, plot=F)$out
which(df$test1 %in% badexample)
This would work if I wanted the outliers of both groups together on test1 but I want to separate by group.
Output should contain: Outliers for group 0 on test1 outliers for group 0 on test2 outliers for group 1 on test1 outliers for group 1 on test2
Upvotes: 2
Views: 280
Reputation: 11140
Here's a way with dplyr
-
df %>%
mutate_at(
vars(starts_with("test")),
list(outlier = ~(abs(. - mean(.)) > 2*sd(.)))
)
id group test1 test2 test1_outlier test2_outlier
1 1 0 57 82 FALSE FALSE
2 2 0 77 80 FALSE FALSE
3 3 0 67 90 FALSE FALSE
4 4 0 15 70 FALSE FALSE
5 5 0 58 72 FALSE FALSE
6 6 1 18 44 FALSE FALSE
7 7 1 44 44 FALSE FALSE
8 8 1 18 46 FALSE FALSE
9 9 1 20 44 FALSE FALSE
10 10 1 14 38 FALSE FALSE
Upvotes: 0
Reputation: 76402
You can write a function to compute the outliers and then call it with ave
.
outlier <- function(x, SD = 2){
mu <- mean(x)
sigma <- sd(x)
out <- x < mu - SD*sigma | x > mu + SD*sigma
out
}
with(df, ave(test1, group, FUN = outlier))
# [1] 0 0 0 0 0 0 0 0 0 0
with(df, ave(test2, group, FUN = outlier))
# [1] 0 0 0 0 0 0 0 0 0 0
To have new columns in df
with these results, assign in the usual way.
df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))
Upvotes: 1
Reputation: 84519
An option, using data.table
:
library(data.table)
df <- read.table(header=T, sep=",", text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
DT <- as.data.table(df)
DT[, `:=`(mean1 = mean(test1), sd1 = sd(test1), mean2 = mean(test2), sd2 = sd(test2)), by = "group"]
DT[, `:=`(outlier1 = abs(test1-mean1)>2*sd1, outlier2 = abs(test2-mean2)>2*sd2)]
DT
# id group test1 test2 mean1 sd1 mean2 sd2 outlier1 outlier2
# 1: 1 0 57 82 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 2: 2 0 77 80 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 3: 3 0 67 90 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 4: 4 0 15 70 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 5: 5 0 58 72 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 6: 6 1 18 44 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 7: 7 1 44 44 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 8: 8 1 18 46 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 9: 9 1 20 44 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 10: 10 1 14 38 22.8 12.04990 43.2 3.033150 FALSE FALSE
Upvotes: 0