Reputation: 11657
Let's say I have a data.frame consisting of industry type and starting and ending dates (e.g. for an employee).
mydf <- data.frame(industry = c("Government", "Education", "Military", "Private Sector", "Government", "Private Sector"),
start_date = c("2014-01-01", "2016-02-01", "2012-11-01", "2013-03-01", "2012-12-01", "2011-12-01"),
end_date = c("2020-12-01", "2016-10-01", "2014-01-01", "2016-10-01", "2015-10-01", "2014-09-01"))
> mydf
industry start_date end_date
1 Government 2014-01-01 2020-12-01
2 Education 2016-02-01 2016-10-01
3 Military 2012-11-01 2014-01-01
4 Private Sector 2013-03-01 2016-10-01
5 Government 2012-12-01 2015-10-01
6 Private Sector 2011-12-01 2014-09-01
I'd like to create a stacked ggplot bar chart in which each unique year in the start_date
column is on the X axis (e.g. 2011-2016) and the y axis represents the total number of observations (the row count) represented in a given industry for that year.
I'm not sure what the right way to manipulate the data.frame to allow for this. Presumably I'd need to manipulate the data to have columns for industry
year
and count
. But I'm not sure how to produce the year columns from a date range. Any ideas?
Upvotes: 3
Views: 799
Reputation: 886938
Convert the date columns to Date
, create the 'date' seq
uence from the 'start_date' to 'end_date' for each row with map2
(from purrr
), unnest
the list
output, count
the year
and plot with geom_bar
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
mydf %>%
mutate(across(c(start_date, end_date), as.Date)) %>%
transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>%
unnest(c(date)) %>%
count(industry, year = factor(year(date))) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_col() +
theme_bw()
If the plot should be separate for each 'industry'
mydf %>%
mutate(across(c(start_date, end_date), as.Date)) %>%
transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>%
unnest(c(date)) %>%
count(industry, year = factor(year(date))) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_col() +
facet_wrap(~ industry) +
theme_bw()
-output
As @IanCampbell suggested, the by
for seq
can be 'year'
mydf %>%
mutate(across(c(start_date, end_date), as.Date)) %>%
transmute(industry, date = map2(start_date, end_date, seq, by = 'year')) %>%
unnest(c(date)) %>%
count(industry, year = factor(year(date))) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_col() +
facet_wrap(~ industry) +
theme_bw()
Upvotes: 3
Reputation: 24770
Is this what you're looking for?
I would recommend using purrr::pmap
to create a new data frame with one row for each year based on each row of the original data.
We can use the purrr::pmap_dfr
to automatically return a single data frame bound by row.
We can use the ~with(list(...), )
trick to be able to reference columns by name.
Then we can use dplyr::count
to count by combinations of columns. Then it's easy.
library(dplyr)
library(purrr)
library(lubridate)
library(ggplot)
mydf %>%
mutate(across(c(start_date, end_date), as.Date),
start_year = year(start_date),
end_year = year(end_date)) %>%
pmap_dfr(~with(list(...),data.frame(industry,
year = seq(start_year, end_year)))) %>%
count(year, industry) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_bar(stat="identity")
Upvotes: 2