Karmic Dreamwork
Karmic Dreamwork

Reputation: 51

R - Find x days from start date while keeping dates inbetween

I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.

I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!

The error I got is this (for the commented out part of the code): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1

Any help would be greatly appreciated!

library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
                            "2020-06-09",
                            "2020-06-10",
                            "2020-06-11",
                            "2020-06-12",
                            "2021-07-13",    
                            "2021-07-14",
                            "2021-07-15",
                            "2021-08-16",
                            "2021-08-17",
                            "2021-08-18",
                            "2021-09-19",
                            "2021-09-20"),
                   value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
                   category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)                      
View(data)

enter image description here

test_dates <- data %>%
  group_by(category) %>%
  arrange(date) %>%
  slice(1L) %>% #takes first date
  mutate(first_day = as.Date(date) - 5)#%>%
  #seq(as.Date(first_day),by="day",length.out=5)

#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)

enter image description here

The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns

Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",

Related question here: How do I subset my df for the minimum date based on one category and including x days before that?

Upvotes: 1

Views: 99

Answers (3)

langtang
langtang

Reputation: 24845

This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA for these rows)

library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]

Output:

          date value category
 1: 2020-06-03    NA        1
 2: 2020-06-04    NA        1
 3: 2020-06-05    NA        1
 4: 2020-06-06    NA        1
 5: 2020-06-07    NA        1
 6: 2020-06-08     2        1
 7: 2021-07-08    NA        2
 8: 2021-07-09    NA        2
 9: 2021-07-10    NA        2
10: 2021-07-11    NA        2
11: 2021-07-12    NA        2
12: 2021-07-13     1        2
13: 2021-08-11    NA        3
14: 2021-08-12    NA        3
15: 2021-08-13    NA        3
16: 2021-08-14    NA        3
17: 2021-08-15    NA        3
18: 2021-08-16     4        3
19: 2021-09-14    NA        4
20: 2021-09-15    NA        4
21: 2021-09-16    NA        4
22: 2021-09-17    NA        4
23: 2021-09-18    NA        4
24: 2021-09-19     5        4
          date value category

Note: The above uses a join; an identical result can be achieved without a join by row-binding the first row for each category with the data.table generated similarly as above:

rbind(
  setDT(data)[order(date), .SD[1],category],
  data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)

You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA in the second input to rbind, you can also just leave it out, and add fill=TRUE within the rbind()

A dplyr version of the same is:

bind_rows(
  data %>% 
    group_by(category) %>%
    slice_min(date) %>%
    ungroup() %>%
    mutate(date=as.Date(date)),
  data %>%
    group_by(category) %>%
    summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)

Output:

# A tibble: 24 x 3
   date       value category
   <date>     <dbl>    <dbl>
 1 2020-06-08     2        1
 2 2021-07-13     1        2
 3 2021-08-16     4        3
 4 2021-09-19     5        4
 5 2020-06-03    NA        1
 6 2020-06-04    NA        1
 7 2020-06-05    NA        1
 8 2020-06-06    NA        1
 9 2020-06-07    NA        1
10 2021-07-08    NA        2
# ... with 14 more rows

Update (9/21/22) -

If you want the NA values to be filled, simply add this to the end of either data.table pipeline:

 ...[,value:=max(value, na.rm=T), category]

or add this to the dplyr pipeline

... %>%
  group_by(category) %>%
  mutate(value=max(value, na.rm=T))

Upvotes: 2

Jon Spring
Jon Spring

Reputation: 66935

Here's one approach but kinda clunky:

bind_rows(
  data,
  data %>%
    group_by(category) %>%
    slice_min(date) %>%
    uncount(6, .id = "id") %>%
    mutate(date = date - id + 1) %>%
    select(-id)) %>%
  arrange(category, date)

Result

# A tibble: 37 × 3
   date       value category
   <date>     <dbl>    <dbl>
 1 2020-06-03     2        1
 2 2020-06-04     2        1
 3 2020-06-05     2        1
 4 2020-06-06     2        1
 5 2020-06-07     2        1
 6 2020-06-08     2        1
 7 2020-06-08     2        1
 8 2020-06-09     1        1
 9 2020-06-10     7        1
10 2020-06-11     1        1
# … with 27 more rows

Upvotes: 3

TarJae
TarJae

Reputation: 79224

@Jon Srpings answer fired this alternative approach:

Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated within filter). Last main step is to use coalesce:

library(lubridate)
library(dplyr)

data %>%
  group_by(category) %>%
  mutate(x = min(ymd(date))-5) %>% 
  slice(1) %>% 
  bind_rows(data) %>% 
  mutate(date = ymd(date)) %>% 
  filter(!duplicated(date)) %>% 
  mutate(x = coalesce(x, date)) %>% 
  arrange(category) %>% 
  select(date = x, value)
   category date       value
      <dbl> <date>     <dbl>
 1        1 2020-06-03     2
 2        1 2020-06-09     1
 3        1 2020-06-10     7
 4        1 2020-06-11     1
 5        1 2020-06-12     0
 6        2 2021-07-08     1
 7        2 2021-07-14     2
 8        2 2021-07-15     3
 9        3 2021-08-11     4
10        3 2021-08-17     7
11        3 2021-08-18     6
12        4 2021-09-14     5
13        4 2021-09-20    10

Upvotes: 2

Related Questions