caker1012
caker1012

Reputation: 55

Why is ggplot ignoring my factor levels when I subset my data?

I am using some code I got from an answer to a previous question, but I ran into a funny problem and Id like some expert insight into what is going on. I am trying to plot monthly deviations from an annual mean using bar charts. Specifically I am coloring the different bars different colors depending on whether the monthly mean is above or below the annual mean. I am using the txhousing dataset, which is included with the ggplot2 package.

I thought I could use a factor to denote whether or not this is the case. The months are correctly ordered when I only plot a subset of the data (the "lower" values, but when I add another plot, ggplot rearranges all of the months to be alphabetical. Does anyone know why this happens, and what a workaround would be?

Thank you so much for any input! Criticism of my code is welcome :)

Reproducible Examples

1. Using just one plot

library(tidyverse)

# subset txhousing to just years >= 2011, and calculate nested means and dates
housing_df <- filter(txhousing, year == 2014) %>%
  group_by(year, month) %>%
  summarise(monthly_mean = mean(sales, na.rm = TRUE),
            date = first(date)) %>%
  mutate(month = factor(month.abb[month], levels = month.abb, ordered = TRUE),
         salesdiff = monthly_mean - mean(monthly_mean), # monthly deviation
         higherlower = case_when(salesdiff >= 0 ~ "higher",                                   
                                 salesdiff < 0 ~ "lower"))

ggplot(data = housing_df, aes(x = month, y = salesdiff, higherlower)) +
  geom_col(data = filter(housing_df, higherlower == "higher"), aes(y = salesdiff, fill = higherlower)) +
  scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
  theme_bw() +
  theme(legend.position = "none") # remove legend

enter image description here

2. Using two plots with all of the data:

ggplot(data = housing_df, aes(x = month, y = salesdiff, higherlower)) +
  geom_col(data = filter(housing_df, higherlower == "higher"), aes(y = salesdiff, fill = higherlower)) +
  geom_col(data = filter(housing_df, higherlower == "lower"), aes(y = salesdiff, fill = higherlower)) +
  scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
  theme_bw() +
  theme(legend.position = "none") # remove legend

enter image description here

Upvotes: 4

Views: 2330

Answers (2)

Mike Lee
Mike Lee

Reputation: 144

Additionally, + scale_x_discrete(drop = FALSE) also overrides potentially different factor levels from different data sources in your ggplot.

This topic is also addressed here: https://github.com/tidyverse/ggplot2/issues/577

Upvotes: 5

Rohit Das
Rohit Das

Reputation: 2032

There are multiple ways to do this but I find it a bit of a hit and trial. You are already doing the most common fix which is t convert month into a factor and that's why the first plot works. Why does it not work in the 2nd case is a bit of a mystery but try adding + scale_x_discrete(limits= housing_df$month) to override the x axis order and see if that works.

I agree to the other comments that the best way would be not even use the extra layer as its not needed in this specific case but the above solution works even when there are multiple layers.

Upvotes: 7

Related Questions