Reputation: 11697
Let's say I have a data.frame that looks like this:
user_df = read.table(text = "person_id job_number job_type start_date end_date
1 1 B 2012-11-01 2014-01-01
1 2 A 2016-02-01 2016-10-01
1 3 A 2016-12-01 2020-01-01
1 4 B 2020-01-01 2021-01-01
2 1 A 2011-03-01 2012-08-01
2 2 B 2013-01-01 2020-01-01
2 3 A 2020-01-01 2021-01-01
2 4 B 2021-01-01 2021-01-17
3 1 A 2005-03-01 2011-03-01
3 2 B 2012-01-01 2014-01-01", header = T)
Each person_id
has a start and an ending date for a given job. I would like to insert rows in between the empty space between jobs, and create an additional column called unemployed
that is set to 1 for those columns.
The resulting data.frame for the first several rows would look like this:
user_df = read.table(text = "person_id job_number job_type start_date end_date unemployed
1 1 B 2012-11-01 2014-01-01 0
1 1 B 2014-01-01 2016-02-01 1
1 2 A 2016-02-01 2016-10-01 0
1 2 A 2016-10-01 2016-12-01 1
1 3 A 2016-12-01 2020-01-01 0
1 4 B 2020-01-01 2021-01-01 0
2 1 A 2011-03-01 2012-08-01 0
2 1 A 2012-08-01 2013-01-01 1
2 2 B 2013-01-01 2020-01-01 0", header = T)
So I'm essentially inserting a new row with the previous rows' end date as its start date and the next row's start date as its end date.
Not sure where to even start with this. I was able to compute the total amount of time spent unemployed by simply summing up the days spanning the earliest start date and the last ending date and subtracting that from the total time actually accumulated by each row. But I'm not sure how I'd go about programatically inserting rows within a dplyr chain to fill in the unemployed time.
Upvotes: 0
Views: 870
Reputation: 160817
library(dplyr)
user_df %>%
arrange(start_date) %>%
group_by(person_id) %>%
mutate(nextstart = lead(start_date)) %>%
filter(end_date < nextstart) %>%
mutate(start_date = end_date, end_date = nextstart, unemployed = 1L) %>%
select(-nextstart) %>%
bind_rows(mutate(user_df, unemployed = 0L)) %>%
arrange(person_id, start_date) %>%
ungroup()
# # A tibble: 14 x 6
# person_id job_number job_type start_date end_date unemployed
# <int> <int> <chr> <chr> <chr> <int>
# 1 1 1 B 2012-11-01 2014-01-01 0
# 2 1 1 B 2014-01-01 2016-02-01 1
# 3 1 2 A 2016-02-01 2016-10-01 0
# 4 1 2 A 2016-10-01 2016-12-01 1
# 5 1 3 A 2016-12-01 2020-01-01 0
# 6 1 4 B 2020-01-01 2021-01-01 0
# 7 2 1 A 2011-03-01 2012-08-01 0
# 8 2 1 A 2012-08-01 2013-01-01 1
# 9 2 2 B 2013-01-01 2020-01-01 0
# 10 2 3 A 2020-01-01 2021-01-01 0
# 11 2 4 B 2021-01-01 2021-01-17 0
# 12 3 1 A 2005-03-01 2011-03-01 0
# 13 3 1 A 2011-03-01 2012-01-01 1
# 14 3 2 B 2012-01-01 2014-01-01 0
Technically, this is comparing by the alphabetic sort of dates; in this case, its effect is the same (the format is good for that) though it'll be slightly less efficient (integer/numeric sorting is faster than alphabetic sorting).
This works by first creating and then capture just the unemployed periods of time,
user_df %>%
arrange(start_date) %>%
group_by(person_id) %>%
mutate(nextstart = lead(start_date)) %>%
filter(end_date < nextstart)
# # A tibble: 4 x 6
# # Groups: person_id [3]
# person_id job_number job_type start_date end_date nextstart
# <int> <int> <chr> <chr> <chr> <chr>
# 1 3 1 A 2005-03-01 2011-03-01 2012-01-01
# 2 2 1 A 2011-03-01 2012-08-01 2013-01-01
# 3 1 1 B 2012-11-01 2014-01-01 2016-02-01
# 4 1 2 A 2016-02-01 2016-10-01 2016-12-01
then shifting the variables, then adding unemployed
, and then finally returning it to the original dataset. In this case, I added unemployed
to the original mid-bind_rows
; where to do this is mostly preference.
Upvotes: 3