Prateek
Prateek

Reputation: 61

Find all rows from one dataframe based on date looked into the date range in other dataframe in R

I have 2 data frames as mentioned below:

df1 <- data.frame(ID=c(1,1,1,2,3,4,4), actual.date=c('10/01/1996','10/02/1996','5/01/2002','7/01/1999','9/01/2005','5/01/2006','2/03/2003'),
val=c(5,10,15,20,25,30,35))
dcis <- grep('date$',names(df1))
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y')
df1

df2 <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004'))
dcis <- grep('date$',names(df2))
df2[dcis] <- lapply(df2[dcis],as.Date,'%m/%d/%Y')
df2

Requirement -> I will start with each row of df2 and see how many rows in df1 lies within (inclusive range) date range specified in that row of df2 (grouped on ID).

For ex: for 1st row in df2, there are 2 rows in df1 (1st and 2nd) which has ID=1 and lies in date range of 1st row of df2. In the final output, I will sum 5+10 (from column 'val') and put against 1st row of df2.

Upvotes: 1

Views: 57

Answers (3)

akrun
akrun

Reputation: 887951

We can use a non-equi join in data.table

library(data.table)
val1 <- setDT(df1)[df2, sum(val), on = .(ID, actual.date >= before.date, 
           actual.date <= after.date), by = .EACHI]$V1
df2$val <- val1
df2$val
#[1] 15 15 15 NA 25 35 NA NA

Or another option is a fuzzy_join

library(fuzzyjoin)
library(dplyr)
fuzzy_right_join(df1, df2, by = c("ID",
    "actual.date" = "before.date", "actual.date"= "after.date"), 
      match_fun = list(`==`, `>=`, `<=`)) %>%
 group_by(ID = ID.y, before.date, after.date) %>% 
 summarise(val = sum(val, na.rm = TRUE))
# A tibble: 8 x 4
# Groups:   ID, before.date [8]
#     ID before.date after.date   val
#  <dbl> <date>      <date>     <dbl>
#1     1 1996-10-01  1996-12-01    15
#2     1 1998-01-01  2003-09-30    15
#3     1 2000-01-01  2004-12-31    15
#4     2 2001-01-01  2006-03-31     0
#5     3 2001-01-01  2006-09-30    25
#6     4 2001-01-01  2005-09-30    35
#7     4 2004-10-01  2004-12-30     0
#8     4 2004-10-03  2004-11-28     0

Upvotes: 1

Bruno
Bruno

Reputation: 4150

Maybe this a duplicate post

dplyr left_join by less than, greater than condition

df1 %>% 
  mutate(dummy=TRUE) %>%
  left_join(df2 %>% mutate(dummy=TRUE)) %>%
  filter(actual.date >= before.date, actual.date < after.date) %>%
  select(-dummy)

We also have fuzzyjoin.

library(fuzzyjoin)

fuzzy_inner_join(df1,df2,
                  by = c("actual.date" = "before.date","actual.date" = "after.date","ID" = "ID"),
                  match_fun = list(`>=`, `<=`,`==`))

Upvotes: 0

ThomasIsCoding
ThomasIsCoding

Reputation: 102810

I am not sure if this is what your are after. Here is a base R solution

df2$res <- apply(df2,1, function(x) sum(df1$val[df1$ID == x["ID"] 
                                                & df1$actual.date>= x["before.date"] 
                                                & df1$actual.date<= x["after.date"]]))

such that

> df2
  ID before.date after.date res
1  1  1996-10-01 1996-12-01  15
2  1  1998-01-01 2003-09-30  15
3  1  2000-01-01 2004-12-31  15
4  2  2001-01-01 2006-03-31   0
5  3  2001-01-01 2006-09-30  25
6  4  2001-01-01 2005-09-30  35
7  4  2004-10-01 2004-12-30   0
8  4  2004-10-03 2004-11-28   0

Upvotes: 0

Related Questions