Parseltongue
Parseltongue

Reputation: 11697

Fill all gaps between starting and ending dates with dplyr

Let's say I have a data.frame that looks like this:

user_df = read.table(text = "person_id job_number job_type start_date end_date
                  1 1 B 2012-11-01 2014-01-01
                  1 2 A 2016-02-01 2016-10-01
                  1 3 A 2016-12-01 2020-01-01
                  1 4 B 2020-01-01 2021-01-01
                  2 1 A 2011-03-01 2012-08-01
                  2 2 B 2013-01-01 2020-01-01
                  2 3 A 2020-01-01 2021-01-01
                  2 4 B 2021-01-01 2021-01-17
                  3 1 A 2005-03-01 2011-03-01
                  3 2 B 2012-01-01 2014-01-01", header = T)

Each person_id has a start and an ending date for a given job. I would like to insert rows in between the empty space between jobs, and create an additional column called unemployed that is set to 1 for those columns.

The resulting data.frame for the first several rows would look like this:

user_df = read.table(text = "person_id job_number job_type start_date end_date unemployed
                  1 1 B 2012-11-01 2014-01-01 0
                  1 1 B 2014-01-01 2016-02-01 1
                  1 2 A 2016-02-01 2016-10-01 0
                  1 2 A 2016-10-01 2016-12-01 1
                  1 3 A 2016-12-01 2020-01-01 0
                  1 4 B 2020-01-01 2021-01-01 0
                  2 1 A 2011-03-01 2012-08-01 0
                  2 1 A 2012-08-01 2013-01-01 1
                  2 2 B 2013-01-01 2020-01-01 0", header = T)

So I'm essentially inserting a new row with the previous rows' end date as its start date and the next row's start date as its end date.

Not sure where to even start with this. I was able to compute the total amount of time spent unemployed by simply summing up the days spanning the earliest start date and the last ending date and subtracting that from the total time actually accumulated by each row. But I'm not sure how I'd go about programatically inserting rows within a dplyr chain to fill in the unemployed time.

Upvotes: 0

Views: 870

Answers (1)

r2evans
r2evans

Reputation: 160817

library(dplyr)
user_df %>%
  arrange(start_date) %>%
  group_by(person_id) %>%
  mutate(nextstart = lead(start_date)) %>%
  filter(end_date < nextstart) %>%
  mutate(start_date = end_date, end_date = nextstart, unemployed = 1L) %>%
  select(-nextstart) %>%
  bind_rows(mutate(user_df, unemployed = 0L)) %>%
  arrange(person_id, start_date) %>%
  ungroup()
# # A tibble: 14 x 6
#    person_id job_number job_type start_date end_date   unemployed
#        <int>      <int> <chr>    <chr>      <chr>           <int>
#  1         1          1 B        2012-11-01 2014-01-01          0
#  2         1          1 B        2014-01-01 2016-02-01          1
#  3         1          2 A        2016-02-01 2016-10-01          0
#  4         1          2 A        2016-10-01 2016-12-01          1
#  5         1          3 A        2016-12-01 2020-01-01          0
#  6         1          4 B        2020-01-01 2021-01-01          0
#  7         2          1 A        2011-03-01 2012-08-01          0
#  8         2          1 A        2012-08-01 2013-01-01          1
#  9         2          2 B        2013-01-01 2020-01-01          0
# 10         2          3 A        2020-01-01 2021-01-01          0
# 11         2          4 B        2021-01-01 2021-01-17          0
# 12         3          1 A        2005-03-01 2011-03-01          0
# 13         3          1 A        2011-03-01 2012-01-01          1
# 14         3          2 B        2012-01-01 2014-01-01          0

Technically, this is comparing by the alphabetic sort of dates; in this case, its effect is the same (the format is good for that) though it'll be slightly less efficient (integer/numeric sorting is faster than alphabetic sorting).

This works by first creating and then capture just the unemployed periods of time,

user_df %>%
  arrange(start_date) %>%
  group_by(person_id) %>%
  mutate(nextstart = lead(start_date)) %>%
  filter(end_date < nextstart)
# # A tibble: 4 x 6
# # Groups:   person_id [3]
#   person_id job_number job_type start_date end_date   nextstart 
#       <int>      <int> <chr>    <chr>      <chr>      <chr>     
# 1         3          1 A        2005-03-01 2011-03-01 2012-01-01
# 2         2          1 A        2011-03-01 2012-08-01 2013-01-01
# 3         1          1 B        2012-11-01 2014-01-01 2016-02-01
# 4         1          2 A        2016-02-01 2016-10-01 2016-12-01

then shifting the variables, then adding unemployed, and then finally returning it to the original dataset. In this case, I added unemployed to the original mid-bind_rows; where to do this is mostly preference.

Upvotes: 3

Related Questions