Show outliers in an efficient manner using ggplot

Question

The actual data (and aim) I have is different but for reproducing purposes I used the Titanic dataset. My aim is create a plot of the age outliers (1 time SD) per class and sex.

Therefore the first thing I did is calculating the sd values and ranges:

library(dplyr)
library(ggplot2)

#Load titanic set
titanic <- read.csv("titanic_total.csv")
group <- group_by(titanic, Pclass, Sex)

#Create outlier ranges
summarise <- summarise(group, mean=mean(Age), sd=sd(Age))
summarise <- as.data.frame(summarise)
summarise$outlier_max <- summarise$mean + summarise$sd
summarise$outlier_min <- summarise$mean - summarise$sd

#Create a key
summarise$key <- paste0(summarise$Pclass, summarise$Sex)

#Create a key for the base set
titanic$key <- paste0(titanic$Pclass, titanic$Sex)

total_data <- left_join(titanic, summarise, by = "key")
total_data$outlier <- 0

Next, using a loop I determine whether the age is inside or outside the range

for (row in 1:nrow(total_data)){
 if((total_data$Age[row]) > (total_data$outlier_max[row])){
  total_data$outlier[row] <- 1
 } else if ((total_data$Age[row]) < (total_data$outlier_min[row])){
  total_data$outlier[row] <- 1
 } else {
  total_data$outlier[row] <- 0
 }
}

Do some data cleaning ...

total_data$Pclass.x <- as.factor(total_data$Pclass.x)
total_data$outlier <- as.factor(total_data$outlier)

Now this code gives me the plot I am looking for.

ggplot(total_data, aes(x = Age, y = Pclass.x, colour = outlier)) + geom_point() +
 facet_grid(. ~Sex.x)

However, this not really seems like the easiest way to crack this problem. Any thoughts on how I can include best practises to make this more efficients.

cdermont · Accepted Answer

One way to reduce your code and make it less repetitive is to get it all into one procedure thanks to the pipe. Instead of creating a summary with the values, re-join this with the data, you could basically do this within one mutate step:

titanic %>% 
  mutate(Pclass = as.factor(Pclass)) %>% 
  group_by(Pclass, Sex) %>% 
  mutate(Age.mean = mean(Age), 
         Age.sd = sd(Age), 
         outlier.max = Age.mean + Age.sd, 
         outlier.min = Age.mean - Age.sd, 
         outlier = as.factor(ifelse(Age > outlier.max, 1, 
                                    ifelse(Age < outlier.min, 1, 0)))) %>% 
  ggplot() +
    geom_point(aes(Age, Pclass, colour = outlier)) +
    facet_grid(.~Sex)

Pclass is mutated to a factor in advance, as it is a grouping factor. Then, the steps are done within the original dataframe, instead of creating two new ones. No changes are made to the original dataset however! If you would want this, just reassign the results to titanic or another data frame, and execute the ggplot-part as next step. Else you would assign the result of the figure to your data.

For the identification of outliers, one way is to work with the ifelse. Alternatively, dplyr offers the nice between function, however, for this, you would need to add rowwise, i.e. after creating the min and max thresholds for outliers:

...
rowwise() %>% 
    mutate(outlier = as.factor(as.numeric(between(Age, outlier.min, outlier.max)))) %>% ...

Plus: Additionally, you could even reduce your code further, depends on which variables you want to keep in which way:

titanic %>% 
    group_by(Pclass, Sex) %>% 
    mutate(outlier = as.factor(ifelse(Age > (mean(Age) + sd(Age)), 1, 
                                      ifelse(Age < (mean(Age) - sd(Age)), 1, 0)))) %>% 
    ggplot() +
    geom_point(aes(Age, as.factor(Pclass), colour = outlier)) +
    facet_grid(.~Sex)

Show outliers in an efficient manner using ggplot

Answers (1)

Related Questions