Reputation: 815
I have one dataframe containing all values (data) collected over time from different users (df). I have a separate dataframe (df1) indicating the start/end date of different events. I would like to use the second dataframe to filter the values in the first (i.e., just keep dates between certain events). My question is identical to this question - Take dates from one dataframe and filter data in another dataframe. However, the two answers there don't work for me. R can't allocate enough memory size to the left_join solution (I have a very large dataframe). The data.table answer almost works, but rather than pasting all dates in the output it only pastes the start date (but seems to work besides this - see below). Any fixes on either solution would be much appreciated. I've kept the same example code as in the initial question for consistency....
user=c(rep('A',7),rep('B',8))
data = seq(1:15)
date = as.Date(c('2016-01-01','2016-01-02','2016-01-03','2016-01-04','2016-01-05','2016-01-06','2016-01-07','2016-01-08','2016-01-09','2016-01-10','2016-01-11','2016-01-12','2016-01-13','2016-01-14','2016-01-15'))
df = data.frame(user,date,data)
df1 =data.frame(user = c('A','B'), start_date = as.Date(c('2016-01-02','2016-01-10')), end_date = as.Date(c('2016-01-06','2016-01-14')))
Dplyr solution (doesn't work on my data due to memory size):
library(dplyr)
df<-left_join(df,df1,by="user")
df <- df %>% filter(date>=start_date & date<=end_date)
Data.table solution (almost works, but just pastes the start date and not all original dates):
require(data.table)
setDT(df)[df1, .(user,date,data), on=.(user, date>=start_date, date<=end_date)]
Upvotes: 0
Views: 82
Reputation: 160437
This looks to be a non-equi/range join.
library(dplyr)
inner_join(df, df1, join_by(user, between(date, start_date, end_date)))
# user date data start_date end_date
# 1 A 2016-01-02 2 2016-01-02 2016-01-06
# 2 A 2016-01-03 3 2016-01-02 2016-01-06
# 3 A 2016-01-04 4 2016-01-02 2016-01-06
# 4 A 2016-01-05 5 2016-01-02 2016-01-06
# 5 A 2016-01-06 6 2016-01-02 2016-01-06
# 6 B 2016-01-10 10 2016-01-10 2016-01-14
# 7 B 2016-01-11 11 2016-01-10 2016-01-14
# 8 B 2016-01-12 12 2016-01-10 2016-01-14
# 9 B 2016-01-13 13 2016-01-10 2016-01-14
# 10 B 2016-01-14 14 2016-01-10 2016-01-14
library(data.table)
DT <- as.data.table(df) # setDT is canonical
DT1 <- as.data.table(df1)
DT[, d := date
][DT1, on = .(user, d >= start_date, d <= end_date)
][, c("d", "d.1") := NULL][]
# user date data
# <char> <Date> <int>
# 1: A 2016-01-02 2
# 2: A 2016-01-03 3
# 3: A 2016-01-04 4
# 4: A 2016-01-05 5
# 5: A 2016-01-06 6
# 6: B 2016-01-10 10
# 7: B 2016-01-11 11
# 8: B 2016-01-12 12
# 9: B 2016-01-13 13
# 10: B 2016-01-14 14
We need to take a little more care here since the default is to rename/overwrite some of the join values; frustrating perhaps, but "this is the way".
Demo without the extra steps:
DT[DT1, on = .(user, date >= start_date, date <= end_date)]
# user date data date.1
# <char> <Date> <int> <Date>
# 1: A 2016-01-02 2 2016-01-06
# 2: A 2016-01-02 3 2016-01-06
# 3: A 2016-01-02 4 2016-01-06
# 4: A 2016-01-02 5 2016-01-06
# 5: A 2016-01-02 6 2016-01-06
# 6: B 2016-01-10 10 2016-01-14
# 7: B 2016-01-10 11 2016-01-14
# 8: B 2016-01-10 12 2016-01-14
# 9: B 2016-01-10 13 2016-01-14
# 10: B 2016-01-10 14 2016-01-14
As you can see, we have lost the original DT$date
values, and the values from start_date
and end_date
have been renamed to d
and d.1
(the column name I joined on).
Upvotes: 1
Reputation: 834
I am a tidyverse user. I don't have experice with memory size problems.
My solution include:
I am also curious about the results of Clemsang's solution; if group_by() will solve the memory problem.
library(dplyr)
library(lubridate)
library(purrr)
df2 = df1 %>% mutate(interval = interval(start =start_date, end = end_date)) %>%
select(user, interval)
df %>% split(.$user) %>%
map( \(user_df) user_df %>% left_join(df2) %>%
filter (date %within% interval ) ) %>%
list_rbind(names_to = "user")
Upvotes: 1
Reputation: 5481
Here is one other way using dplyr
. Dates are compared per group to replace left_join
.
df %>%
group_by(user) %>%
filter(date >= df1$start_date[df1$user == first(user)]) %>%
filter(date <= df1$end_date[df1$user == first(user)]) %>%
ungroup()
Please tell us if execution time gets better.
Upvotes: 1