analyzing set of many dummy columns based on date ranges

Question

I am analyzing staff turnover and need to create (1) a count of the number of staff hired and exited in a given year and (2) compute a cumulative "total staff" count across years. I have hire and exit dates like this:

ssh<-structure(list(HireDate = structure(c(1358, 4291, 5121, 6923, 9678, 12037, 16353, 17003, 18976, 19312, 19312, 19011), class = "Date"),     ExitDate = structure(c(15861, 15401, 17140, 17347, NA, NA,     16911, 18856, 19193, NA, NA, NA), class = "Date"), id = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"))

ssh$hireyear<-lubridate::year(ssh$HireDate)
ssh$exityear<-lubridate::year(ssh$ExitDate)

ssh$group<-c("a","b","c","a","","b","a","","","b","b","c")

For the simple accounting of hires and exits, I'd like to have a dummy variable for EACH year. So for the data above, if staff was hired in 2014, create a new column hired2014 equal to 1, else 0, like this:

 ssh$hire1984<-ifelse(ssh$hireyear==1984,1,0)

or

 ssh$exit2012<-ifelse(ssh$exityear==2012,1,0)

My full dataset ranges between 1972 and 2023, so I'd like an efficient method to compute all possible variables for any date range. This would yield a dataframe with many columns -- one for each year.

Next, I'd like to format the resulting dataframe by year, something like this:

Year  NumberHired  NumberExited  NetChange  CurrentTotal
1972  4            0             4          4
1973  2            1             1          5
1974  3            4             -1         4
.
.
2010  25           11            14         541   ...etc

I experimented creating a lookup table of years that I might populate with an aggregation of the dummy variables but am hitting a wall. Also, other solutions I've found on stackoverflow tend to focus on creating only one dummy variable.

Ideas? Thx!

Gregor Thomas · Accepted Answer

We can use tidyr::pivot_wider to make your dummies:

library(tidyr)
library(dplyr)

ssh |>
  mutate(dummy = 1) |>
  pivot_wider(
    names_from = hireyear, 
    values_from = dummy,
    names_prefix = "hired",
    values_fill = 0
  ) |>
  mutate(dummy = 1) |> 
  pivot_wider(
    names_from = exityear, 
    values_from = dummy,
    names_prefix = "exit",
    values_fill = 0
  )
# # A tibble: 12 × 20
#    HireDate   ExitDate   id    hired1973 hired1981 hired1984 hired1988 hired1996 hired2002
#                                            
#  1 1973-09-20 2013-06-05 1             1         0         0         0         0         0
#  2 1981-10-01 2012-03-02 2             0         1         0         0         0         0
#  3 1984-01-09 2016-12-05 3             0         0         1         0         0         0
#  4 1988-12-15 2017-06-30 4             0         0         0         1         0         0
#  5 1996-07-01 NA         5             0         0         0         0         1         0
#  6 2002-12-16 NA         6             0         0         0         0         0         1
#  7 2014-10-10 2016-04-20 7             0         0         0         0         0         0
#  8 2016-07-21 2021-08-17 8             0         0         0         0         0         0
#  9 2021-12-15 2022-07-20 9             0         0         0         0         0         0
# 10 2022-11-16 NA         10            0         0         0         0         0         0
# 11 2022-11-16 NA         11            0         0         0         0         0         0
# 12 2022-01-19 NA         12            0         0         0         0         0         0
# # ℹ 11 more variables: hired2014 , hired2016 , hired2021 , hired2022 ,
# #   exit2013 , exit2012 , exit2016 , exit2017 , exitNA ,
# #   exit2021 , exit2022

And the summary table is fairly easily computed from your original data:

hires = count(ssh, hireyear, name = "NumberHired") |>
  rename(year = hireyear)
exits = count(ssh, exityear, name = "NumberExited", ) |>
  rename(year = exityear) |>
  filter(!is.na(year))
full_join(hires, exits, by = "year") %>%
  replace(is.na(.), 0) |>
  arrange(year) |>
  mutate(
    NetChange = NumberHired - NumberExited,
    CurrentTotal = cumsum(NetChange)
  )
# # A tibble: 13 × 5
#     year NumberHired NumberExited NetChange CurrentTotal
#                                
#  1  1973           1            0         1            1
#  2  1981           1            0         1            2
#  3  1984           1            0         1            3
#  4  1988           1            0         1            4
#  5  1996           1            0         1            5
#  6  2002           1            0         1            6
#  7  2012           0            1        -1            5
#  8  2013           0            1        -1            4
#  9  2014           1            0         1            5
# 10  2016           1            2        -1            4
# 11  2017           0            1        -1            3
# 12  2021           1            1         0            3
# 13  2022           3            1         2            5

analyzing set of many dummy columns based on date ranges

Answers (1)

Related Questions