Reputation: 2413
I am trying to match two datasets by nearest preceding date, by group. So within a group, I would like to add the variables of a second dataset (d2) to that of the first (d1) when the date of the first is the nearest date on or before the date in the second. If two rows in the second dataset are matched with one row in the first I would like to add the larger of the values. (there will always be at least one date in d1 less then the date in d2, by group)
Here is an example, which hopefully makes it clearer
d1 = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )))
d1
# id ref
# 1 1 2013-12-07
# 2 1 2014-12-07
# 3 1 2015-12-07
# 4 2 2013-11-07
# 5 2 2014-11-07
d2 = data.frame(id=c(1,1,2),
date=as.Date(c("2014-05-07","2014-12-05", "2015-11-05")),
x1 = factor(c(1,2,2), ordered = TRUE),
x2 = factor(c(2, NA ,2), ordered=TRUE))
d2
# id date x1 x2
# 1 1 2014-05-07 1 2
# 2 1 2014-12-05 2 <NA>
# 3 2 2015-11-05 2 2
With the expected outcome
output = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )),
x1 = c(2, NA, NA, NA, 2),
x2 = c(2, NA, NA, NA, 2))
output
# id ref x1 x2
# 1 1 2013-12-07 2 2
# 2 1 2014-12-07 NA NA
# 3 1 2015-12-07 NA NA
# 4 2 2013-11-07 NA NA
# 5 2 2014-11-07 2 2
So for example, the first two observations of d2, id=1, with dates "2014-05-07","2014-12-05"
, are matched to the earlier date "2013-12-07"
in d1. As there are two rows matched to one row in d1,
then the highest level is selected.
I could do this in base R by looping the following calculations through each group but I was hoping for something more efficient. I would love to see a data.table approach (but I am limited to R v3.1 and data.table v1.9.4). Thanks
real dataset:
d1: rows 1M / 100K groups
d2: rows 11K / 4K groups
# for one group
x = d1[d1$id==1, ]
y = d2[d2$id==1, ]
id = apply(outer(x$ref, y$date, "-"), 2, which.min)
temp = cbind(y, ref=x$ref[id])
# aggregate variables by ref
temp = merge(aggregate(x1 ~ ref, data=temp, max),
aggregate(x2 ~ ref, data=temp, max)
)
merge(x, temp, all=T)
ps: I had looked at How to match by nearest date from two data frames? and Join data.table on exact date or if not the case on the nearest less than date with no success.
Upvotes: 4
Views: 1135
Reputation: 7435
You can do this using dplyr
:
d2$ind <- 0
library(dplyr)
out <- d1 %>% full_join(d2,by=c("id","ref"="date")) %>%
arrange(id,ref) %>%
mutate(ind=cumsum(ifelse(is.na(ind),1,ind))) %>%
group_by(ind) %>%
summarise(ref=min(ref),x1=max(x1,na.rm=TRUE),x2=max(x2,na.rm=TRUE))
### A tibble: 5 x 4
## ind ref x1 x2
## <dbl> <date> <fctr> <fctr>
##1 1 2013-12-07 2 2
##2 2 2014-12-07 NA NA
##3 3 2015-12-07 NA NA
##4 4 2013-11-07 NA NA
##5 5 2014-11-07 2 2
We first add a column of indicators to d2
and set those to zero. Then, we perform a full outer join between d1
and d2
. Those rows in d1
will have ind
of NA
. We sort by id
and ref
(i.e., the date), and we replace the NA
entries of ind
with 1
and perform a cumsum
. This results in:
id ref x1 x2 ind
1 1 2013-12-07 <NA> <NA> 1
2 1 2014-05-07 1 2 1
3 1 2014-12-05 2 <NA> 1
4 1 2014-12-07 <NA> <NA> 2
5 1 2015-12-07 <NA> <NA> 3
6 2 2013-11-07 <NA> <NA> 4
7 2 2014-11-07 <NA> <NA> 5
8 2 2015-11-05 2 2 5
From this we can easily see that we can group by ind
and summarise
appropriately to get your result.
Upvotes: 1