Reputation: 915
I'm looking for a function that would allow for subsetting a data frame based on the density of bivariate observations. For example:
ggplot(iris, aes(x = Petal.Length, y = Sepal.Width, color = Species)) +
stat_density2d(geom = 'polygon', aes(fill = ..level..), n = 8) +
geom_point()
Here, I would like to only display the points that are outliers based on the density of points within a Species (i.e. only show the 3 points from setosa and 4 points from virginica that lay outside the contours).
Upvotes: 0
Views: 493
Reputation: 1311
My methodology is a little convoluted, so bear with me, I'll explain below:
library(data.table)
dt <- as.data.table(iris)[, .(Petal.Length, Sepal.Width, Species)]
dt[, sample := .I]
dt <- melt(dt, id.vars = c("Species", "sample"))
dt[, c("meanval", "sdval") := .(mean(value), sd(value)), .(Species, variable)]
dt[abs({value - meanval} / sdval) > 2, outlier := TRUE]
dt[, anyOutliers := sum(outlier, na.rm = T), sample]
dt[anyOutliers != 0, outlier := TRUE]
dt <- dcast(
dt[, .(Species, variable, value, outlier, sample)],
sample + outlier + Species ~ variable,
value.var = "value"
)
First we assign dt
as the data set and keep only the columns we plan to plot. Next, we assign a dummy column which will be important for this particular dataset to differentiate rows later. Then we melt()
the dataset for expediency. Then, for each species, we calculate the mean and standard deviation of each value. This allows us to, on the line below, define outliers (you can change > 2
here to affect the number of SD to use).
Then, for each flower, we find if it is an outlier in any of our chosen metrics (in this case petal.length and sepal.width). If it is, the whole flower gets labelled an outlier. Then, we dcast the table back into it's original form, only now there's an outlier column that shows whether or not the flower was an outlier in any of out metrics.
I won't go into plotting these, as you can figure out how you want to do that on your own, but this should give a general gist of the direction to go. Hope that helps.
Upvotes: 1
Reputation: 657
This is a rather hack-y solution, but you could write a function to extract the points outside of the contour plot and return a data frame with just those points:
plot_outliers_only <- function (original_plot) {
require(ggplot2)
require(sp)
pb <- ggplot_build(original_plot)
group_labels <- grep("001", levels(pb$data[[1]]$group), value=TRUE)
outlier_points <- lapply(group_labels, function (gl) {
contour_data <- filter(pb$data[[1]], as.character(group)==gl)
original_data <-
group_id <- as.numeric(strsplit(gl, "-")[[1]][1])
outlier_id <- pb$data[[2]] %>%
filter(group==group_id) %>%
select(c(x, y)) %>%
apply(1, function (point) {
point.in.polygon(point[1], point[2], contour_data$x, contour_data$y)==0
}) %>%
which()
if (length(outlier_id)==0) return (outlier_id)
grouping_name <- as.character(original_plot$mapping$colour)
as.numeric(original_plot$data[, grouping_name]) %>%
`==`(group_id) %>%
which() %>%
slice(original_plot$data, .) %>%
`[`(., outlier_id, )
})
do.call(what=rbind, outlier_points)
}
P <- ggplot(iris, aes(x = Petal.Length, y = Sepal.Width, color = Species)) +
stat_density2d(geom = 'polygon', aes(fill = ..level..), n = 8) +
geom_point()
plot_outliers_only(P)
Upvotes: 1