Getting summary values from other dataset up to some date value for each observation

Question

I currently have two datasets. The first one has weekly observations for each ID (the number of weeks observed may not be the same across different IDs, but all are within a fixed range, the year of 2022):

ID	Date	Var1
1	2022-07-04	200
1	2022-07-11	45
1	2022-07-18	70
1	2022-08-01	90
2	2022-06-27	50
2	2022-07-18	70
2	2022-08-01	90
3	2022-09-26	70
3	2022-10-03	90

The second one has the same IDs, but with values recorded not according to some fixed interval (i.e. can be any date and number of observations)

ID	Date	Var2	Var3
1	2010-12-06	200	0.5
1	2022-07-29	300	0.6
2	2015-06-27	100	0.1
2	2016-07-04	150	0.1
2	2016-07-05	200	0.4
2	2018-09-21	600	0.0

My goal is to include in the first dataset summaries of Var2 and Var3 from the second dataset UP TO the value specified in Date (i.e. the average of all values of Var2 and Var3 prior to Date). The desired output would be:

ID	Date	Var1	Mean_Var2	Mean_Var3	N
1	2022-07-04	200	200	0.5	1
1	2022-07-11	45	200	0.5	1
1	2022-07-18	70	200	0.5	1
1	2022-08-01	90	250	0.55	2
2	2022-06-27	50	262.5	0.15	4
2	2022-07-18	70	262.5	0.15	4
2	2022-08-01	90	262.5	0.15	4
3	2022-12-01	70	NA	NA	0
3	2022-12-03	90	NA	NA	0

where N contains a count of the values which were averaged.

df1 <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
                  Date = as.Date(c('2022-07-04', '2022-07-11', '2022-07-18', '2022-08-01',
                                   '2022-06-27', '2022-07-18', '2022-08-01',
                                   '2022-12-01', '2022-12-03')),
                  Var1 = c(200, 45, 70, 90, 50, 70, 90, 70, 90))

df2 <- data.frame(ID = c(1, 1, 2, 2, 2, 2),
                  Date = as.Date(c('2010-12-06', '2022-07-29', 
                                   '2015-06-27', '2016-07-04', '2016-07-05', '2018-09-21')),
                  Var2 = c(200, 300, 100, 150, 200, 600),
                  Var3 = c(0.5, 0.6, 0.1, 0.1, 0.4, 0.0))

I am a bit stumped as to how to go about this. I was thinking of just iterating by row across the first dataset and constructing the measure for each row, but the datasets themselves are also quite large ( > 1 million obs each), so this seems inefficient. Does anyone know a faster way?

SamR · Accepted Answer

You want to join by ID in all cases where the Date in df2 is less than the Date in df1. Sounds like a case for the data.table non-equi join!

library(data.table)

setDT(df1)
setDT(df2)

df2[, date_for_join_df2 := Date]

df2[df1,
    on = .(
        ID,
        date_for_join_df2 < Date
    )
][, .(
        ID, 
        Date = date_for_join_df2, 
        Mean_Var2 = mean(Var2),
        Mean_Var3 = mean(Var3),
        .N
    )
, by = .(ID, date_for_join_df2)
]


#    ID date_for_join_df2 ID       Date Mean_Var2 Mean_Var3 N
# 1:  1        2022-07-04  1 2022-07-04     200.0      0.50 1
# 2:  1        2022-07-11  1 2022-07-11     200.0      0.50 1
# 3:  1        2022-07-18  1 2022-07-18     200.0      0.50 1
# 4:  1        2022-08-01  1 2022-08-01     250.0      0.55 2
# 5:  2        2022-06-27  2 2022-06-27     262.5      0.15 4
# 6:  2        2022-07-18  2 2022-07-18     262.5      0.15 4
# 7:  2        2022-08-01  2 2022-08-01     262.5      0.15 4
# 8:  3        2022-12-01  3 2022-12-01        NA        NA 1
# 9:  3        2022-12-03  3 2022-12-03        NA        NA 1

Thanks to the magic of data.table this should be quite fast. Note the df2[, date_for_join_df2 := Date]. It's important to copy the column and join on that as you lose one of the columns you join on - which is fine with an equi join, but not when it's not equal to the column it has joined to.

I linked to the data.table vignette above. This blog post by David Selby contains a more in-depth explanation and examples.

Getting summary values from other dataset up to some date value for each observation

Answers (1)

Related Questions