Tom
Tom

Reputation: 2341

Excluding outliers, from the regression line fitted through a scatterplot, without removing the outlier from the plot

I have data as follows, for which I run ggplot code below:

data <- structure(list(country_mean_rep = structure(c(73.6995708154506, 
93.5501285347044, 85.1529051987768, 91.1017369727047, 79.5562130177515, 
84.6751054852321, 89.8, 86.8826405867971, 94.2247191011236, 70.2321428571429, 
88.4107142857143), label = "label", format.stata = "%9.2f"), 
    country_mean_crime = c(0.0944206008583691, 0.0565552699228792, 
    0.0336391437308868, 0.205955334987593, 0.130177514792899, 
    0.282700421940928, 0.220512820512821, 0.415647921760391, 
    0.387640449438202, 0.200892857142857, 0.292207792207792), 
    country_name = structure(c(1L, 2L, 3L, 4L, 5L, 7L, 11L, 12L, 
    14L, 16L, 20L), .Label = c("Albania", "Armenia", "Azerbaijan", 
    "Belarus", "Bosnia and Herzegovina", "Brazil", "Bulgaria", 
    "Cambodia", "Chile", "CostaRica", "Croatia", "Czech", "Ecuador", 
    "Estonia", "FYROM", "Georgia", "Germany", "Greece", "Guyana", 
    "Hungary", "Ireland", "Kazakhstan", "Kenya", "Kyrgyzstan", 
    "Latvia", "Lithuania", "Malawi", "Mali", "Moldova", "Philippines", 
    "Poland", "Portugal", "Romania", "Russia", "Senegal", "Serbia&Montenegro", 
    "Slovakia", "Slovenia", "South Africa", "South Korea", "Spain", 
    "SriLanka", "Tajikistan", "Turkey", "Ukraine", "Uzbekistan", 
    "Vietnam"), class = "factor")), row.names = c(NA, -11L), class = c("data.table", 
"data.frame"))

# On which I like to run the following code:

ggplot(data, aes(x=country_mean_rep, y=country_mean_crime)) + 
  geom_point() + 
  geom_smooth(aes(colour="linear", fill="linear"), 
              method="lm", 
              formula=y ~ x, ) + 
  geom_smooth(aes(colour="quadratic", fill="quadratic"), 
              method="lm", 
              formula=y ~ x + I(x^2)) + 
  geom_smooth(aes(colour="cubic", fill="cubic"), 
              method="lm", 
              formula=y ~ x + I(x^2) + I(x^3)) + 
  labs(colour="Functional Form", fill="Functional Form") +
  geom_text(aes(label=country_name), nudge_y=0.02) +
  theme_bw()

enter image description here

Now let's say that the Czech Republic is an outlier, which I want to remove for the fits I am doing (especially the linear one). Please note that I understand there is nothing wrong with the Czech Republic in the example, I need to know this for a proper outlier in my actual data.

Is there some way of excluding it only from the fit, while keeping the dot in the plot?

Upvotes: 4

Views: 1297

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

Here is a way.
Start the plot with the subset of the data that excludes "Czech". And only use the entire data set for the data argument of geom_point. Like this the point "Czech" will be plotted but excluded from the fits.

In fact, excluded from everything else. So if you want the "Czech" label you will have to also use data = data (the full data set) in geom_text.

library(ggplot2)

ggplot(data = subset(data, country_name != "Czech"), aes(x=country_mean_rep, y=country_mean_crime)) + 
  geom_point(data = data)
  geom_smooth(aes(colour="linear", fill="linear"), 
              method="lm", 
              formula=y ~ x, ) + 
  geom_smooth(aes(colour="quadratic", fill="quadratic"), 
              method="lm", 
              formula=y ~ x + I(x^2)) + 
  geom_smooth(aes(colour="cubic", fill="cubic"), 
              method="lm", 
              formula=y ~ x + I(x^2) + I(x^3)) + 
  labs(colour="Functional Form", fill="Functional Form") +
  geom_text(aes(label=country_name), nudge_y=0.02) +
  theme_bw()

enter image description here

Upvotes: 2

Cole
Cole

Reputation: 11255

One way to do it would be to include different data plots:

ggplot(subset(data, country_name != 'Czech'), aes(x=country_mean_rep, y=country_mean_crime)) + 
  geom_smooth(aes(colour="linear", fill="linear"), 
              method="lm", 
              formula=y ~ x, ) + 
  geom_smooth(aes(colour="quadratic", fill="quadratic"), 
              method="lm", 
              formula=y ~ x + I(x^2)) + 
  geom_smooth(aes(colour="cubic", fill="cubic"), 
              method="lm", 
              formula=y ~ x + I(x^2) + I(x^3)) + 
  labs(colour="Functional Form", fill="Functional Form") +
  geom_point(data = data, inherit.aes = FALSE, aes(x = country_mean_rep, y = country_mean_crime)) +
  geom_text(data = data, aes(label=country_name, x = country_mean_rep, y = country_mean_crime), inherit.aes = FALSE, nudge_y=0.02) +
  theme_bw()

In this case, the 3 linear models use the subsetted data whereas the calls to geom_point and geom_text do not inherit the original aestetics.

Upvotes: 2

Related Questions