Reputation: 51
I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.
I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!
The error I got is this (for the commented out part of the code): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
Any help would be greatly appreciated!
library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
"2020-06-09",
"2020-06-10",
"2020-06-11",
"2020-06-12",
"2021-07-13",
"2021-07-14",
"2021-07-15",
"2021-08-16",
"2021-08-17",
"2021-08-18",
"2021-09-19",
"2021-09-20"),
value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)
View(data)
test_dates <- data %>%
group_by(category) %>%
arrange(date) %>%
slice(1L) %>% #takes first date
mutate(first_day = as.Date(date) - 5)#%>%
#seq(as.Date(first_day),by="day",length.out=5)
#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)
The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns
Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",
Related question here: How do I subset my df for the minimum date based on one category and including x days before that?
Upvotes: 1
Views: 99
Reputation: 24845
This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA
for these rows)
library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]
Output:
date value category
1: 2020-06-03 NA 1
2: 2020-06-04 NA 1
3: 2020-06-05 NA 1
4: 2020-06-06 NA 1
5: 2020-06-07 NA 1
6: 2020-06-08 2 1
7: 2021-07-08 NA 2
8: 2021-07-09 NA 2
9: 2021-07-10 NA 2
10: 2021-07-11 NA 2
11: 2021-07-12 NA 2
12: 2021-07-13 1 2
13: 2021-08-11 NA 3
14: 2021-08-12 NA 3
15: 2021-08-13 NA 3
16: 2021-08-14 NA 3
17: 2021-08-15 NA 3
18: 2021-08-16 4 3
19: 2021-09-14 NA 4
20: 2021-09-15 NA 4
21: 2021-09-16 NA 4
22: 2021-09-17 NA 4
23: 2021-09-18 NA 4
24: 2021-09-19 5 4
date value category
Note: The above uses a join; an identical result can be achieved without a join by row-binding the first
row for each category with the data.table generated similarly as above:
rbind(
setDT(data)[order(date), .SD[1],category],
data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)
You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA
in the second input to rbind
, you can also just leave it out, and add fill=TRUE
within the rbind()
A dplyr
version of the same is:
bind_rows(
data %>%
group_by(category) %>%
slice_min(date) %>%
ungroup() %>%
mutate(date=as.Date(date)),
data %>%
group_by(category) %>%
summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)
Output:
# A tibble: 24 x 3
date value category
<date> <dbl> <dbl>
1 2020-06-08 2 1
2 2021-07-13 1 2
3 2021-08-16 4 3
4 2021-09-19 5 4
5 2020-06-03 NA 1
6 2020-06-04 NA 1
7 2020-06-05 NA 1
8 2020-06-06 NA 1
9 2020-06-07 NA 1
10 2021-07-08 NA 2
# ... with 14 more rows
If you want the NA values to be filled, simply add this to the end of either data.table
pipeline:
...[,value:=max(value, na.rm=T), category]
or add this to the dplyr pipeline
... %>%
group_by(category) %>%
mutate(value=max(value, na.rm=T))
Upvotes: 2
Reputation: 66935
Here's one approach but kinda clunky:
bind_rows(
data,
data %>%
group_by(category) %>%
slice_min(date) %>%
uncount(6, .id = "id") %>%
mutate(date = date - id + 1) %>%
select(-id)) %>%
arrange(category, date)
Result
# A tibble: 37 × 3
date value category
<date> <dbl> <dbl>
1 2020-06-03 2 1
2 2020-06-04 2 1
3 2020-06-05 2 1
4 2020-06-06 2 1
5 2020-06-07 2 1
6 2020-06-08 2 1
7 2020-06-08 2 1
8 2020-06-09 1 1
9 2020-06-10 7 1
10 2020-06-11 1 1
# … with 27 more rows
Upvotes: 3
Reputation: 79224
@Jon Srpings answer fired this alternative approach:
Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated
within filter
). Last main step is to use coalesce
:
library(lubridate)
library(dplyr)
data %>%
group_by(category) %>%
mutate(x = min(ymd(date))-5) %>%
slice(1) %>%
bind_rows(data) %>%
mutate(date = ymd(date)) %>%
filter(!duplicated(date)) %>%
mutate(x = coalesce(x, date)) %>%
arrange(category) %>%
select(date = x, value)
category date value
<dbl> <date> <dbl>
1 1 2020-06-03 2
2 1 2020-06-09 1
3 1 2020-06-10 7
4 1 2020-06-11 1
5 1 2020-06-12 0
6 2 2021-07-08 1
7 2 2021-07-14 2
8 2 2021-07-15 3
9 3 2021-08-11 4
10 3 2021-08-17 7
11 3 2021-08-18 6
12 4 2021-09-14 5
13 4 2021-09-20 10
Upvotes: 2