jwillis0720
jwillis0720

Reputation: 4477

How to remove one outlier in ggplot2 facet point_plot that squash the rest of the data

I produced a faceted plot that I'm very satisfied with except for one issue. On a couple of the plots, one or two outliers completely ruin the graph. I could use y_lim function, but I'm using facet_grid(scales="free") so each plot has a unique limit. Here is my code and the graph it produced.

melted_df <- melt(df, id='ca_rmsd')
ggplot(melted_df,aes(ca_rmsd,value)) + geom_point() 
       + facet_grid(varible ~.,scales="free")

produces
(source: willisjr at structbio.vanderbilt.edu)

As you can see the top plot has a data point WAY outside the axis that smashes the rest.

Upvotes: 4

Views: 6698

Answers (1)

Adrian
Adrian

Reputation: 3308

Here's a possibility:

library(ggplot2)
n <- 1000
df <- data.frame(x=rnorm(n), y=rnorm(n),
                 label=sample(letters[1:4], size=n,
                   replace=TRUE))
df$y[1:50] <- 50  # Add some outliers

## Similar to your plot
ggplot(df, aes(x, y)) + geom_point() + facet_wrap(~ label)

library(plyr)
df.quantiles <- ddply(df, "label", summarise,
                      q99=quantile(y, probs=0.99),
                      q90=quantile(y, probs=0.90))
df <- merge(df, df.quantiles, by="label", all.x=TRUE)

## More or less what you want?
ggplot(df[df$y < df$q99, ],
       aes(x, y)) + geom_point() + facet_wrap(~ label)

This assumes there are only outliers above, but you could easily extend it to do the same below.

You could try something slightly more sophisticated, maybe

df[df$y < df$q99 | (df$q99 / df$q90) < some.ratio, ]

where you choose some.ratio so that you only throw out the largest 1% of Ys when they are deemed to be outliers, rather than all the time.

Hope that helps.

Upvotes: 2

Related Questions