Jash Shah
Jash Shah

Reputation: 2164

count number of observations between two overlapping dates r

Let's say we have a data frame defined as follows:

mydata <- data.frame(id = c('A', 'B', 'C', 'D'),
                     start_date = as.Date(c('2012-08-05',
                                            '2013-05-04',
                                            '2012-02-01',
                                            '2015-03-02')),
                     end_date = as.Date(c('2014-01-12',
                                          '2015-06-05',
                                          '2016-05-06',
                                          '2017-09-12')))

Where start_date talks about the day an employee joined and end_date talks about the day he left and id is the unique employee id.

For each month from 5 August 2012 (the earliest start_date) to 12 September 2017 (the latest end_date) I would like the employee count month wise. The final output should be in a format similar to the one below: (doesn't matter if its in wide format or long format) Final Output Table

In the table above the columns denote the months (1 to 12), the rows the year and the cells in the table the number of employees in that month.

Any help will be highly appreciated.

Upvotes: 4

Views: 489

Answers (4)

ira
ira

Reputation: 2644

You can use also a data.table package or combo of data.table and dplyr.

I will show the dplyr and data.table version (The only reason I use dplyr is the %>% operator. You could also do the whole thing in one line without using %>%).

# load data.table
library(data.table)
# load dplyr
library(dplyr)

#for each employee id, list first days of months during which employee was working, then transform from long to wide format using dcast function
dt <- setDT(mydata)[, list(date = seq(as.Date(format(min(start_date), "%Y-%m-01")),
 as.Date(format(max(end_date), "%Y-%m-01")),
 by = "month")), by = id] %>% dcast(year(date) ~ month(date))

What exactly the code above does?

the by = id tells you that the operation (the calculation of dates in this case) in data.table will be performed for each employee id.

format(max(end_date), "%Y-%m-01") gives you the start of the last month during which each of the employee worked.

format(min(start_date), "%Y-%m-01)" start of the month in which the employee started working.

seq(..., ...., by = "month") gives you first day of all the months in which each of the employees worked.

%>% is the piping operator, it means "then". It is the same as using the result of setDT(mydata)[..., ..., ...] as the first parameter of dcast function.

dcast function transforms the long format (in this case, the result of setDT[...]) to wide format.

End that is the end :)

Upvotes: 0

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

Here is a solution with mapply in base R.

# Function to get date of first day of a month (by @digEmAll)
toFirstDayOfMonth <- function(dates) dates - as.POSIXlt(dates)$mday + 1

# Generate all dates
dates <- Reduce(c, with(mydata, mapply(seq, toFirstDayOfMonth(start_date), end_date,
                                       by = "month")))

# Count occurrences of year/month combinations
table(format(dates, "%Y"), format(dates, "%m"))

The result:

       01 02 03 04 05 06 07 08 09 10 11 12
  2012  0  1  1  1  1  1  1  2  2  2  2  2
  2013  2  2  2  2  3  3  3  3  3  3  3  3
  2014  3  2  2  2  2  2  2  2  2  2  2  2
  2015  2  2  3  3  3  3  2  2  2  2  2  2
  2016  2  2  2  2  2  1  1  1  1  1  1  1
  2017  1  1  1  1  1  1  1  1  1  0  0  0

Upvotes: 8

www
www

Reputation: 39154

A solution from tidyverse and lubridate.

library(tidyverse)
library(lubridate)

mydata2 <- mydata %>%
  mutate(Dates = map2(start_date, end_date, ~seq(.x, .y, by = "day"))) %>%
  unnest() %>%
  mutate(Year = year(Dates), Month = month(Dates)) %>%
  group_by(Year, Month) %>%
  summarise(Employee = n_distinct(id)) %>%
  spread(Month, Employee, fill = 0)
mydata2
# A tibble: 6 x 13
# Groups:   Year [6]
   Year   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`  `12`
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  2012     0     1     1     1     1     1     1     2     2     2     2     2
2  2013     2     2     2     2     3     3     3     3     3     3     3     3
3  2014     3     2     2     2     2     2     2     2     2     2     2     2
4  2015     2     2     3     3     3     3     2     2     2     2     2     2
5  2016     2     2     2     2     2     1     1     1     1     1     1     1
6  2017     1     1     1     1     1     1     1     1     1     0     0     0

Upvotes: 3

count
count

Reputation: 1338

You can try:

table(unlist(lapply(1:nrow(mydata), function(x) {
    format(seq(from=mydata[x,2],to=mydata[x,3],by="month"),"%Y-%m")
    })))

Upvotes: 0

Related Questions