KvA
KvA

Reputation: 13

Given a date range how to expand to the number of days per month in that range?

Case:
Given are a dataframe df with (among others) a startDate and an endDate column. My goal is to "transform" df to a new dataframe df_res that contains one row for every month occuring in the interval (startDate, endDate), with additional columns year, month and numberOfDaysInMonth, all of type int.

Example:
Input: df

  id    startDate     endDate  someOtherCol
   1   2017-09-23  2018-02-01          val1
   2   2018-01-01  2018-03-31          val2
 ...          ...         ...           ...

Desired output: df_res

  id  year  month  numberOfDaysInMonth  someOtherCol
   1  2017      9                    8          val1
   1  2017     10                   31          val1
   1  2017     11                   30          val1
   1  2017     12                   31          val1
   1  2018      1                   31          val1
   1  2018      2                    1          val1
   2  2018      1                   31          val2
   2  2018      2                   28          val2
   2  2018      3                   31          val2
 ...   ...    ...                  ...           ... 

Background:
I am relatively new to R but aware of the great dplyrand lubridate packages. I just did not manage to achieve the above in a neat way, even when using those packages. The closest I got was this: Expand rows by date range using start and end date, but that does not yield the number of days per month contained in the range.

Any help is greatly appreciated.

Upvotes: 1

Views: 1343

Answers (2)

AntoniosK
AntoniosK

Reputation: 16121

A tidyverse solution:

# example data
df = read.table(text = "
id    startDate     endDate  someOtherCol
1   2017-09-23  2018-02-01          val1
2   2018-01-01  2018-03-31          val2
", header=T, stringsAsFactors=F)

library(tidyverse)
library(lubridate)


df %>%
  mutate_at(vars(startDate, endDate), ymd) %>%                  # update to date columns (if needed)
  group_by(id) %>%                                              # for each id
  mutate(d = list(seq(startDate, endDate, by="1 day"))) %>%     # create a sequence of dates (as a list)
  unnest() %>%                                                  # unnest data
  group_by(id, year=year(d), month=month(d), someOtherCol) %>%  # group by those variables (while getting year and month of each date in the sequence)
  summarise(numberOfDaysInMonth = n()) %>%                      # count days
  ungroup()                                                     # forget the grouping

# # A tibble: 9 x 5
#      id  year month someOtherCol numberOfDaysInMonth
#   <int> <dbl> <dbl> <chr>                      <int>
# 1     1  2017     9 val1                           8
# 2     1  2017    10 val1                          31
# 3     1  2017    11 val1                          30
# 4     1  2017    12 val1                          31
# 5     1  2018     1 val1                          31
# 6     1  2018     2 val1                           1
# 7     2  2018     1 val2                          31
# 8     2  2018     2 val2                          28
# 9     2  2018     3 val2                          31

Upvotes: 3

chinsoon12
chinsoon12

Reputation: 25225

If you do not mind a data.table solution, you can create a sequence of consecutive dates between startDate and endDate before you aggregate by id, someOtherCol, year and month as follows:

dat[, .(Dates=seq(startDate, endDate, by="1 day")), by=.(id, someOtherCol)][,
    .N, by=.(id, someOtherCol, year(Dates), month(Dates))]

output:

   id someOtherCol year month  N
1:  1         val1 2017     9  8
2:  1         val1 2017    10 31
3:  1         val1 2017    11 30
4:  1         val1 2017    12 31
5:  1         val1 2018     1 31
6:  1         val1 2018     2  1
7:  2         val2 2018     1 31
8:  2         val2 2018     2 28
9:  2         val2 2018     3 31

data:

library(data.table)    
dat <- fread("id    startDate     endDate  someOtherCol
1   2017-09-23  2018-02-01          val1
2   2018-01-01  2018-03-31          val2")
datecols <- c("startDate", "endDate")
dat[, (datecols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=datecols]

Upvotes: 2

Related Questions