Arcticweasel
Arcticweasel

Reputation: 43

Plot Percentages in R with different basic totality

I want to visually compare two datasets on traffic stops in two states of the US in R with ggplot2 package. I combined them to one data frame an displayed the total number of traffic stops per year and state. As these Numbers are very different, I want to compare the percentage of stops in relation to the population of each state. First, here is a sample df and what I achieved so far. I used the tidyverse and lubridate in my code.

df <- data.frame(ID=c("CA-2013-0000001","CA-2014-0000001", "TX-2013-0000001", "TX-2014-0000001"),
                    State=c("CA", "CA", "TX", "TX"),
                    Stop_Date=ymd("2013-01-01","2014-01-01", "2013-01-01", "2014-01-01"))

df %>%
  group_by(year = year(Stop_Date), state = State) %>%
  count() %>%
  ggplot(aes(year, n, col = state))+
  geom_point(stat = "identity")+
  geom_line(stat = "identity")

With that code I get a plot with two lines, each refelcting the two states I look at.

I want to create the exact same plot, but instead of total numbers I want to display the percentage in relation to the state populations, which are population_ca <- 38620000 and population_tx <- 26980000.

I tried these two approaches, but return different errors, everytime I run the code:

df %>%
  group_by(year = year(Stop_Date), state = State) %>%
  summarise(PercentStopsToPopulation = if_else(state == "CA",
                                                 ((n()/population_ca)*100),
                                                 ((n()/population_tx)*100))) %>%
  ggplot(aes(year, PercentStopsToPopulation, col = state))+
  geom_point(stat = "identity")+
  geom_line(stat = "identity")

df %>%
  group_by(year = year(Stop_Date), state = State) %>%
  summarise(PercentCA = ifelse(state == "CA",((n()/population_ca)*100)),
            PercentTX = ifelse(state == "TX", ((n()/population_tx)*100))) %>%
  ggplot(aes(year, PercentCA))+
  geom_point(stat = "identity")+
  geom_line(stat = "identity")+
  geom_point(aes(year, PercentTX), stat = "identity")+
  geom_line(aes(year, PercentTX), stat = "identity")

I really hope someone can help me with this and tell me where my mistakes are. Thank you in advance!

Upvotes: 0

Views: 68

Answers (1)

Rebecca Bennett
Rebecca Bennett

Reputation: 131

Here's how I would approach this problem. I use the tidyverse, so you'll notice some changes.

library("tidyverse")

#keep organized and avoid for loops by organizing population data in a tibble
pop <- tibble(state = c("CA", "TX"),
              population = c(38620000, 26980000))

#I made this a tibble instead of a dataframe, just to stay consistent in the tidyverse approach.
df <- tibble(ID=c("CA-2013-0000001","CA-2014-0000001", "TX-2013-0000001", "TX-2014-0000001"),
                 state=c("CA", "CA", "TX", "TX"),
                 stop_date=ymd("2013-01-01","2014-01-01", "2013-01-01", "2014-01-01")) %>%
  #I prefer to err on the side of making more fields, to make it easier to see what we're doing down the road.
  mutate(year = year(stop_date))

# summarise data
df_count <- df %>%
  group_by(year, state) %>%
  count() %>%
  #Join with population table. I prefer this over a for loop - easier to scale up, in case you decide to add more states.
  full_join(pop) %>%
  # Calculate the percent of population
  mutate(percent = 100*n/population)


#Now, we graph!
df_count %>%
  ggplot(aes(year, percent, col = state))+
  geom_point()+
  geom_line()

Please let me know if you have any questions. :)

Upvotes: 1

Related Questions