rahul
rahul

Reputation: 147

How to assign day of year values starting from an arbitary date and take care of missing values?

I have an R dataframe df_demand with a date column (depdate) and a dependent variable column bookings. The duration is 365 days starting from 2017-11-02 and ending at 2018-11-01, sorted in ascending order.

We have booking data for only 279 days in the year.

dplyr::arrange(df_demand, depdate)

           depdate bookings
    1   2017-11-02       43
    2   2017-11-03       27
    3   2017-11-05       27
    4   2017-11-06       22
    5   2017-11-07       39
    6   2017-11-08       48
    .
    .

   279  2018-11-01       60

I want to introduce another column day_of_year in the following way:

    depdate       day_of_year     bookings
1    2017-11-02        1              43
2    2017-11-03        2              27
3    2017-11-04        3              NA
4    2017-11-05        4              27
    .
    .
    .
365  2018-11-01      365              60

I am trying to find the best possible way to do this.

In Python, I could use something like :

df_demand['day_of_year'] = df_demand['depdate'].sub(df_demand['depdate'].iat[0]).dt.days + 1

I wanted to know about an R equivalent of the same.

When I run

typeof(df_demand_2$depdate)

the output is

"double"

Am I missing something?

enter image description here

Upvotes: 1

Views: 77

Answers (1)

A. Stam
A. Stam

Reputation: 2222

You can create a row for every date using the complete function from the tidyr package.

First, I'm creating a data frame with some sample data:

df <- data.frame(
  depdate = as.Date(c('2017-11-02', '2017-11-03', '2017-11-05')),
  bookings = c(43, 27, 27)
)

Next, I'm performing two operations. First, using tidyr::complete, I'm specifying all the dates I want in my analysis. I can do that using seq.Date, creating a sequence from the first to the last day.

Once that is done, the day_of_year column is simply equal to the row number.

df_complete <- tidyr::complete(df,
  depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)
)

df_complete$day_of_year <- 1:nrow(df_complete)

> df_complete
#> # A tibble: 4 x 3
#>   depdate    bookings day_of_year
#>   <date>        <dbl>       <int>
#> 1 2017-11-02       43           1
#> 2 2017-11-03       27           2
#> 3 2017-11-04       NA           3
#> 4 2017-11-05       27           4

An equivalent solution with the pipe operator from dplyr:

df %>%
  complete(depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)) %>%
  mutate(days_of_year = row_number())

Upvotes: 2

Related Questions