DaveM
DaveM

Reputation: 674

Displaying stat_summary within each group, by aesthetic mapping, in ggplot

I am close to plotting what I wanted, but haven't quite figured out whether stat_summary is the right way to display the desired plot.

The desired output is the scatter plot with a median line for each year, within each category. For example, in the plot below, I would want a median line for the values in 1999, 2000, and 2001 in Category A (i.e., 3 lines by color) and then the same in Category B (so 6 median lines total).

I looked here, but this didn't seem to get at what I wanted since it was using facets.

My plot looks like it is drawing a line between the medians of each category. Can stat_summary just draw a median line within each category, or do I need to use a different approach (like calculating the medians and adding each line to the plot by category?

enter image description here

Reproducible simple example

library(tidyverse)
library(lubridate)

# Sample data
Date     <- sort(sample(seq(as.Date("1999-01-01"), as.Date("2002-01-01"), by = "day"), 500))
Category <- rep(c("A", "B"), 250)
Value    <- sample(100:500, 500, replace = TRUE)

# Create data frame
mydata   <- data.frame(Date, Category, Value)

# Plot by category and color by year
p <- ggplot(mydata, aes(x = Category, y = Value,
                        color = factor(year(Date))
                        )
            ) + 
  geom_jitter() 
p


# Now add median values of each year for each group
p <- p +
  stat_summary(fun.y = median,
               geom  = "line",
               aes(color = factor(year(Date))),
               group = 1,
               size = 2
               )
p

Upvotes: 5

Views: 6990

Answers (2)

Maurits Evers
Maurits Evers

Reputation: 50728

Here is another possibility using geom_errorbar (instead of stat_summary)

# Sample data
set.seed(2017);
Date     <- sort(sample(seq(as.Date("1999-01-01"), as.Date("2002-01-01"), by = "day"), 500))
Category <- rep(c("A", "B"), 250)
Value    <- sample(100:500, 500, replace = TRUE)
mydata   <- data.frame(Date, Category, Value)

mydata %>%
    mutate(colour = factor(year(Date))) %>%
    group_by(Category, year(Date)) %>%
    mutate(Median = median(Value)) %>%
    ggplot(aes(Category, Value, colour = colour)) +
    geom_jitter() +
    geom_errorbar(
        aes(ymin = Median, ymax = Median))

enter image description here

Explanation: We pre-compute median values per Category per year(Date) and draw median lines using geom_errorbar.


Update

In response to your comment, if you wanted to use summarise to pre-compute median values you could store median values in a separate data.frame

df <- mydata %>%
    mutate(Year = as.factor(year(Date))) %>%
    group_by(Category, Year) %>%
    summarise(Median = median(Value))

ggplot(mydata, aes(Category, Value, colour = factor(year(Date)))) +
    geom_jitter() +
    geom_errorbar(
        data = df,
        aes(x = Category, y = Median, colour = Year, ymin = Median, ymax = Median))

It's not quite as clean as the first solution (since you need to specify all aesthetics in geom_errorbar) but the resulting plot is the same.

Upvotes: 3

camille
camille

Reputation: 16862

What you're looking for is actually a point, even though it looks like a line, because you don't want to connect observations (what a line does), you just want to show a discrete value (what a point does).

One way, very similar to the post you linked, is to do your stat_summary and use a shape that is essentially a large dash. I turned down the alpha and size of the jittered points to distinguish them from the medians better. For the medians, I kept the color assignment the same but set the group to the interaction between year and category, so there would be six distinct medians calculated.

Note that I set a seed for random number generation and changed the end date to 12/31/2001 instead of 1/1/2002, since you said you expected 3 years but during one generation I got a few observations of 1/1/2002.

library(tidyverse)
library(lubridate)

set.seed(987)
Date     <- sort(sample(seq(as.Date("1999-01-01"), as.Date("2001-12-31"), by = "day"), 500))
Category <- rep(c("A", "B"), 250)
Value    <- sample(100:500, 500, replace = TRUE)

# Create data frame
mydata   <- data.frame(Date, Category, Value)

mydata <- mydata %>%
  mutate(year = year(Date) %>% as.factor())

ggplot(mydata, aes(x = Category, y = Value, color = year)) +
  geom_jitter(size = 0.6, alpha = 0.6) +
  stat_summary(fun.y = median, 
               geom = "point",
               aes(group = interaction(Category, year)),
               shape = 95, size = 12, show.legend = F)

Created on 2018-07-01 by the reprex package (v0.2.0).

Upvotes: 5

Related Questions