James White
James White

Reputation: 815

Using dplyr or data.table to use dates from one dataframe to filter a separate dataframe

I have one dataframe containing all values (data) collected over time from different users (df). I have a separate dataframe (df1) indicating the start/end date of different events. I would like to use the second dataframe to filter the values in the first (i.e., just keep dates between certain events). My question is identical to this question - Take dates from one dataframe and filter data in another dataframe. However, the two answers there don't work for me. R can't allocate enough memory size to the left_join solution (I have a very large dataframe). The data.table answer almost works, but rather than pasting all dates in the output it only pastes the start date (but seems to work besides this - see below). Any fixes on either solution would be much appreciated. I've kept the same example code as in the initial question for consistency....

user=c(rep('A',7),rep('B',8))
data = seq(1:15)
date = as.Date(c('2016-01-01','2016-01-02','2016-01-03','2016-01-04','2016-01-05','2016-01-06','2016-01-07','2016-01-08','2016-01-09','2016-01-10','2016-01-11','2016-01-12','2016-01-13','2016-01-14','2016-01-15'))
df = data.frame(user,date,data)

df1 =data.frame(user = c('A','B'), start_date = as.Date(c('2016-01-02','2016-01-10')),  end_date = as.Date(c('2016-01-06','2016-01-14')))

Dplyr solution (doesn't work on my data due to memory size):

    library(dplyr)
    df<-left_join(df,df1,by="user")
    df <- df %>% filter(date>=start_date & date<=end_date)

Data.table solution (almost works, but just pastes the start date and not all original dates):

require(data.table)
setDT(df)[df1, .(user,date,data), on=.(user, date>=start_date, date<=end_date)]

Upvotes: 0

Views: 82

Answers (3)

r2evans
r2evans

Reputation: 160437

This looks to be a non-equi/range join.

dplyr

library(dplyr)
inner_join(df, df1, join_by(user, between(date, start_date, end_date)))
#    user       date data start_date   end_date
# 1     A 2016-01-02    2 2016-01-02 2016-01-06
# 2     A 2016-01-03    3 2016-01-02 2016-01-06
# 3     A 2016-01-04    4 2016-01-02 2016-01-06
# 4     A 2016-01-05    5 2016-01-02 2016-01-06
# 5     A 2016-01-06    6 2016-01-02 2016-01-06
# 6     B 2016-01-10   10 2016-01-10 2016-01-14
# 7     B 2016-01-11   11 2016-01-10 2016-01-14
# 8     B 2016-01-12   12 2016-01-10 2016-01-14
# 9     B 2016-01-13   13 2016-01-10 2016-01-14
# 10    B 2016-01-14   14 2016-01-10 2016-01-14

data.table

library(data.table)
DT <- as.data.table(df) # setDT is canonical
DT1 <- as.data.table(df1)
DT[, d := date
  ][DT1, on = .(user, d >= start_date, d <= end_date)
  ][, c("d", "d.1") := NULL][]
#       user       date  data
#     <char>     <Date> <int>
#  1:      A 2016-01-02     2
#  2:      A 2016-01-03     3
#  3:      A 2016-01-04     4
#  4:      A 2016-01-05     5
#  5:      A 2016-01-06     6
#  6:      B 2016-01-10    10
#  7:      B 2016-01-11    11
#  8:      B 2016-01-12    12
#  9:      B 2016-01-13    13
# 10:      B 2016-01-14    14

We need to take a little more care here since the default is to rename/overwrite some of the join values; frustrating perhaps, but "this is the way".

Demo without the extra steps:

DT[DT1, on = .(user, date >= start_date, date <= end_date)]
#       user       date  data     date.1
#     <char>     <Date> <int>     <Date>
#  1:      A 2016-01-02     2 2016-01-06
#  2:      A 2016-01-02     3 2016-01-06
#  3:      A 2016-01-02     4 2016-01-06
#  4:      A 2016-01-02     5 2016-01-06
#  5:      A 2016-01-02     6 2016-01-06
#  6:      B 2016-01-10    10 2016-01-14
#  7:      B 2016-01-10    11 2016-01-14
#  8:      B 2016-01-10    12 2016-01-14
#  9:      B 2016-01-10    13 2016-01-14
# 10:      B 2016-01-10    14 2016-01-14

As you can see, we have lost the original DT$date values, and the values from start_date and end_date have been renamed to d and d.1 (the column name I joined on).

Upvotes: 1

VinceGreg
VinceGreg

Reputation: 834

I am a tidyverse user. I don't have experice with memory size problems.

My solution include:

  • (optionnal) Using lubridate intervals to verify if the date is within the specified dates.
  • Splitting the df per user using purrr::map(), and list_rbind after. I hope that will help the memory difficulty.

I am also curious about the results of Clemsang's solution; if group_by() will solve the memory problem.

library(dplyr) 
library(lubridate)
library(purrr)

df2 = df1 %>%  mutate(interval = interval(start =start_date, end = end_date)) %>% 
  select(user, interval)

df %>%  split(.$user) %>% 
  map( \(user_df) user_df %>% left_join(df2) %>% 
        filter (date %within% interval ) ) %>% 
  list_rbind(names_to = "user")

Upvotes: 1

Clemsang
Clemsang

Reputation: 5481

Here is one other way using dplyr. Dates are compared per group to replace left_join.

df %>%
  group_by(user) %>%
  filter(date >= df1$start_date[df1$user == first(user)]) %>%
  filter(date <= df1$end_date[df1$user == first(user)]) %>%
  ungroup()

Please tell us if execution time gets better.

Upvotes: 1

Related Questions