Reputation: 359
I currently have two datasets. The first one has weekly observations for each ID (the number of weeks observed may not be the same across different IDs, but all are within a fixed range, the year of 2022):
ID | Date | Var1 |
---|---|---|
1 | 2022-07-04 | 200 |
1 | 2022-07-11 | 45 |
1 | 2022-07-18 | 70 |
1 | 2022-08-01 | 90 |
2 | 2022-06-27 | 50 |
2 | 2022-07-18 | 70 |
2 | 2022-08-01 | 90 |
3 | 2022-09-26 | 70 |
3 | 2022-10-03 | 90 |
The second one has the same IDs, but with values recorded not according to some fixed interval (i.e. can be any date and number of observations)
ID | Date | Var2 | Var3 |
---|---|---|---|
1 | 2010-12-06 | 200 | 0.5 |
1 | 2022-07-29 | 300 | 0.6 |
2 | 2015-06-27 | 100 | 0.1 |
2 | 2016-07-04 | 150 | 0.1 |
2 | 2016-07-05 | 200 | 0.4 |
2 | 2018-09-21 | 600 | 0.0 |
My goal is to include in the first dataset summaries of Var2 and Var3 from the second dataset UP TO the value specified in Date (i.e. the average of all values of Var2 and Var3 prior to Date). The desired output would be:
ID | Date | Var1 | Mean_Var2 | Mean_Var3 | N |
---|---|---|---|---|---|
1 | 2022-07-04 | 200 | 200 | 0.5 | 1 |
1 | 2022-07-11 | 45 | 200 | 0.5 | 1 |
1 | 2022-07-18 | 70 | 200 | 0.5 | 1 |
1 | 2022-08-01 | 90 | 250 | 0.55 | 2 |
2 | 2022-06-27 | 50 | 262.5 | 0.15 | 4 |
2 | 2022-07-18 | 70 | 262.5 | 0.15 | 4 |
2 | 2022-08-01 | 90 | 262.5 | 0.15 | 4 |
3 | 2022-12-01 | 70 | NA | NA | 0 |
3 | 2022-12-03 | 90 | NA | NA | 0 |
where N
contains a count of the values which were averaged.
df1 <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
Date = as.Date(c('2022-07-04', '2022-07-11', '2022-07-18', '2022-08-01',
'2022-06-27', '2022-07-18', '2022-08-01',
'2022-12-01', '2022-12-03')),
Var1 = c(200, 45, 70, 90, 50, 70, 90, 70, 90))
df2 <- data.frame(ID = c(1, 1, 2, 2, 2, 2),
Date = as.Date(c('2010-12-06', '2022-07-29',
'2015-06-27', '2016-07-04', '2016-07-05', '2018-09-21')),
Var2 = c(200, 300, 100, 150, 200, 600),
Var3 = c(0.5, 0.6, 0.1, 0.1, 0.4, 0.0))
I am a bit stumped as to how to go about this. I was thinking of just iterating by row across the first dataset and constructing the measure for each row, but the datasets themselves are also quite large ( > 1 million obs each), so this seems inefficient. Does anyone know a faster way?
Upvotes: 1
Views: 53
Reputation: 20512
You want to join by ID
in all cases where the Date
in df2
is less than the Date
in df1
. Sounds like a case for the data.table
non-equi join!
library(data.table)
setDT(df1)
setDT(df2)
df2[, date_for_join_df2 := Date]
df2[df1,
on = .(
ID,
date_for_join_df2 < Date
)
][, .(
ID,
Date = date_for_join_df2,
Mean_Var2 = mean(Var2),
Mean_Var3 = mean(Var3),
.N
)
, by = .(ID, date_for_join_df2)
]
# ID date_for_join_df2 ID Date Mean_Var2 Mean_Var3 N
# 1: 1 2022-07-04 1 2022-07-04 200.0 0.50 1
# 2: 1 2022-07-11 1 2022-07-11 200.0 0.50 1
# 3: 1 2022-07-18 1 2022-07-18 200.0 0.50 1
# 4: 1 2022-08-01 1 2022-08-01 250.0 0.55 2
# 5: 2 2022-06-27 2 2022-06-27 262.5 0.15 4
# 6: 2 2022-07-18 2 2022-07-18 262.5 0.15 4
# 7: 2 2022-08-01 2 2022-08-01 262.5 0.15 4
# 8: 3 2022-12-01 3 2022-12-01 NA NA 1
# 9: 3 2022-12-03 3 2022-12-03 NA NA 1
Thanks to the magic of data.table
this should be quite fast. Note the df2[, date_for_join_df2 := Date]
. It's important to copy the column and join on that as you lose one of the columns you join on - which is fine with an equi
join, but not when it's not equal to the column it has joined to.
I linked to the data.table
vignette above. This blog post by David Selby contains a more in-depth explanation and examples.
Upvotes: 1