Reputation: 13
Case:
Given are a dataframe df
with (among others) a startDate
and an endDate
column. My goal is to "transform" df
to a new dataframe df_res
that contains one row for every month occuring in the interval (startDate, endDate)
, with additional columns year
, month
and numberOfDaysInMonth
, all of type int.
Example:
Input: df
id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
... ... ... ...
Desired output: df_res
id year month numberOfDaysInMonth someOtherCol
1 2017 9 8 val1
1 2017 10 31 val1
1 2017 11 30 val1
1 2017 12 31 val1
1 2018 1 31 val1
1 2018 2 1 val1
2 2018 1 31 val2
2 2018 2 28 val2
2 2018 3 31 val2
... ... ... ... ...
Background:
I am relatively new to R but aware of the great dplyr
and lubridate
packages. I just did not manage to achieve the above in a neat way, even when using those packages. The closest I got was this: Expand rows by date range using start and end date, but that does not yield the number of days per month contained in the range.
Any help is greatly appreciated.
Upvotes: 1
Views: 1343
Reputation: 16121
A tidyverse
solution:
# example data
df = read.table(text = "
id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
", header=T, stringsAsFactors=F)
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(startDate, endDate), ymd) %>% # update to date columns (if needed)
group_by(id) %>% # for each id
mutate(d = list(seq(startDate, endDate, by="1 day"))) %>% # create a sequence of dates (as a list)
unnest() %>% # unnest data
group_by(id, year=year(d), month=month(d), someOtherCol) %>% # group by those variables (while getting year and month of each date in the sequence)
summarise(numberOfDaysInMonth = n()) %>% # count days
ungroup() # forget the grouping
# # A tibble: 9 x 5
# id year month someOtherCol numberOfDaysInMonth
# <int> <dbl> <dbl> <chr> <int>
# 1 1 2017 9 val1 8
# 2 1 2017 10 val1 31
# 3 1 2017 11 val1 30
# 4 1 2017 12 val1 31
# 5 1 2018 1 val1 31
# 6 1 2018 2 val1 1
# 7 2 2018 1 val2 31
# 8 2 2018 2 val2 28
# 9 2 2018 3 val2 31
Upvotes: 3
Reputation: 25225
If you do not mind a data.table
solution, you can create a sequence of consecutive dates between startDate and endDate before you aggregate by id, someOtherCol, year and month as follows:
dat[, .(Dates=seq(startDate, endDate, by="1 day")), by=.(id, someOtherCol)][,
.N, by=.(id, someOtherCol, year(Dates), month(Dates))]
output:
id someOtherCol year month N
1: 1 val1 2017 9 8
2: 1 val1 2017 10 31
3: 1 val1 2017 11 30
4: 1 val1 2017 12 31
5: 1 val1 2018 1 31
6: 1 val1 2018 2 1
7: 2 val2 2018 1 31
8: 2 val2 2018 2 28
9: 2 val2 2018 3 31
data:
library(data.table)
dat <- fread("id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2")
datecols <- c("startDate", "endDate")
dat[, (datecols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=datecols]
Upvotes: 2