crich
crich

Reputation: 99

Identifying outliers in different groups

I have a data set where participants were assigned to different groups and completed the same tests. I know I can use the aggregate function to identify the mean and sd but I cannot figure out how to find the outliers in these groups.

df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")

I like the format of this code but do not know how to change it in order to identify outliers for each group for each test.

ALSO, I want outliers to be considered anything greater than 2 standard deviations rather than 3. Can I format that too within this code?

##to get outliers on test1 if groups were combined
badexample <- boxplot(df$test1, plot=F)$out
which(df$test1 %in% badexample)

This would work if I wanted the outliers of both groups together on test1 but I want to separate by group.

Output should contain: Outliers for group 0 on test1 outliers for group 0 on test2 outliers for group 1 on test1 outliers for group 1 on test2

Upvotes: 2

Views: 280

Answers (3)

Shree
Shree

Reputation: 11140

Here's a way with dplyr -

df %>% 
  mutate_at(
    vars(starts_with("test")),
    list(outlier = ~(abs(. - mean(.)) > 2*sd(.)))
  )

   id group test1 test2 test1_outlier test2_outlier
1   1     0    57    82         FALSE         FALSE
2   2     0    77    80         FALSE         FALSE
3   3     0    67    90         FALSE         FALSE
4   4     0    15    70         FALSE         FALSE
5   5     0    58    72         FALSE         FALSE
6   6     1    18    44         FALSE         FALSE
7   7     1    44    44         FALSE         FALSE
8   8     1    18    46         FALSE         FALSE
9   9     1    20    44         FALSE         FALSE
10 10     1    14    38         FALSE         FALSE

Upvotes: 0

Rui Barradas
Rui Barradas

Reputation: 76402

You can write a function to compute the outliers and then call it with ave.

outlier <- function(x, SD = 2){
  mu <- mean(x)
  sigma <- sd(x)
  out <- x < mu - SD*sigma | x > mu + SD*sigma
  out
}

with(df, ave(test1, group, FUN = outlier))
# [1] 0 0 0 0 0 0 0 0 0 0

with(df, ave(test2, group, FUN = outlier))
# [1] 0 0 0 0 0 0 0 0 0 0

To have new columns in df with these results, assign in the usual way.

df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))

Upvotes: 1

St&#233;phane Laurent
St&#233;phane Laurent

Reputation: 84519

An option, using data.table:

library(data.table)

df <- read.table(header=T, sep=",", text="id, group, test1, test2
1, 0, 57, 82
               2, 0, 77, 80
               3, 0, 67, 90
               4, 0, 15, 70
               5, 0, 58, 72
               6, 1, 18, 44
               7, 1, 44, 44
               8, 1, 18, 46
               9, 1, 20, 44
               10, 1, 14, 38")

DT <- as.data.table(df)
DT[, `:=`(mean1 = mean(test1), sd1 = sd(test1), mean2 = mean(test2), sd2 = sd(test2)), by = "group"]
DT[, `:=`(outlier1 = abs(test1-mean1)>2*sd1, outlier2 = abs(test2-mean2)>2*sd2)]
DT
#     id group test1 test2 mean1      sd1 mean2      sd2 outlier1 outlier2
#  1:  1     0    57    82  54.8 23.66854  78.8 8.074652    FALSE    FALSE
#  2:  2     0    77    80  54.8 23.66854  78.8 8.074652    FALSE    FALSE
#  3:  3     0    67    90  54.8 23.66854  78.8 8.074652    FALSE    FALSE
#  4:  4     0    15    70  54.8 23.66854  78.8 8.074652    FALSE    FALSE
#  5:  5     0    58    72  54.8 23.66854  78.8 8.074652    FALSE    FALSE
#  6:  6     1    18    44  22.8 12.04990  43.2 3.033150    FALSE    FALSE
#  7:  7     1    44    44  22.8 12.04990  43.2 3.033150    FALSE    FALSE
#  8:  8     1    18    46  22.8 12.04990  43.2 3.033150    FALSE    FALSE
#  9:  9     1    20    44  22.8 12.04990  43.2 3.033150    FALSE    FALSE
# 10: 10     1    14    38  22.8 12.04990  43.2 3.033150    FALSE    FALSE

Upvotes: 0

Related Questions