Reputation: 2164
Let's say we have a data frame defined as follows:
mydata <- data.frame(id = c('A', 'B', 'C', 'D'),
start_date = as.Date(c('2012-08-05',
'2013-05-04',
'2012-02-01',
'2015-03-02')),
end_date = as.Date(c('2014-01-12',
'2015-06-05',
'2016-05-06',
'2017-09-12')))
Where start_date
talks about the day an employee joined and end_date
talks about the day he left and id
is the unique employee id.
For each month from 5 August 2012 (the earliest start_date
) to 12 September 2017 (the latest end_date
) I would like the employee count month wise. The final output should be in a format similar to the one below: (doesn't matter if its in wide format or long format)
In the table above the columns denote the months (1 to 12), the rows the year and the cells in the table the number of employees in that month.
Any help will be highly appreciated.
Upvotes: 4
Views: 489
Reputation: 2644
You can use also a data.table
package or combo of data.table
and dplyr
.
I will show the dplyr
and data.table
version (The only reason I use dplyr is the %>%
operator. You could also do the whole thing in one line without using %>%
).
# load data.table
library(data.table)
# load dplyr
library(dplyr)
#for each employee id, list first days of months during which employee was working, then transform from long to wide format using dcast function
dt <- setDT(mydata)[, list(date = seq(as.Date(format(min(start_date), "%Y-%m-01")),
as.Date(format(max(end_date), "%Y-%m-01")),
by = "month")), by = id] %>% dcast(year(date) ~ month(date))
What exactly the code above does?
the by = id
tells you that the operation (the calculation of dates in this case) in data.table
will be performed for each employee id.
format(max(end_date), "%Y-%m-01")
gives you the start of the last month during which each of the employee worked.
format(min(start_date), "%Y-%m-01)"
start of the month in which the employee started working.
seq(..., ...., by = "month")
gives you first day of all the months in which each of the employees worked.
%>%
is the piping operator, it means "then". It is the same as using the result of setDT(mydata)[..., ..., ...] as the first parameter of dcast
function.
dcast
function transforms the long
format (in this case, the result of setDT[...]
) to wide
format.
End that is the end :)
Upvotes: 0
Reputation: 81683
Here is a solution with mapply
in base R.
# Function to get date of first day of a month (by @digEmAll)
toFirstDayOfMonth <- function(dates) dates - as.POSIXlt(dates)$mday + 1
# Generate all dates
dates <- Reduce(c, with(mydata, mapply(seq, toFirstDayOfMonth(start_date), end_date,
by = "month")))
# Count occurrences of year/month combinations
table(format(dates, "%Y"), format(dates, "%m"))
The result:
01 02 03 04 05 06 07 08 09 10 11 12
2012 0 1 1 1 1 1 1 2 2 2 2 2
2013 2 2 2 2 3 3 3 3 3 3 3 3
2014 3 2 2 2 2 2 2 2 2 2 2 2
2015 2 2 3 3 3 3 2 2 2 2 2 2
2016 2 2 2 2 2 1 1 1 1 1 1 1
2017 1 1 1 1 1 1 1 1 1 0 0 0
Upvotes: 8
Reputation: 39154
A solution from tidyverse
and lubridate
.
library(tidyverse)
library(lubridate)
mydata2 <- mydata %>%
mutate(Dates = map2(start_date, end_date, ~seq(.x, .y, by = "day"))) %>%
unnest() %>%
mutate(Year = year(Dates), Month = month(Dates)) %>%
group_by(Year, Month) %>%
summarise(Employee = n_distinct(id)) %>%
spread(Month, Employee, fill = 0)
mydata2
# A tibble: 6 x 13
# Groups: Year [6]
Year `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12`
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2012 0 1 1 1 1 1 1 2 2 2 2 2
2 2013 2 2 2 2 3 3 3 3 3 3 3 3
3 2014 3 2 2 2 2 2 2 2 2 2 2 2
4 2015 2 2 3 3 3 3 2 2 2 2 2 2
5 2016 2 2 2 2 2 1 1 1 1 1 1 1
6 2017 1 1 1 1 1 1 1 1 1 0 0 0
Upvotes: 3
Reputation: 1338
You can try:
table(unlist(lapply(1:nrow(mydata), function(x) {
format(seq(from=mydata[x,2],to=mydata[x,3],by="month"),"%Y-%m")
})))
Upvotes: 0