Reputation: 845
I have two data frames, and I am trying to filter out one of the data frames using the two columns from the other one. In my example, I am trying to filter out the data by both the year
and ID
column in df
using the df2
data frame. For example:
I would like to filter out all the dates that have ID
1 in 2010, ID
3 in 2012, ID
2 in 2010, and ID
4 in 2013 from the data frame df
. So that the output only has the dates from 2010 for ID
1, only 2012 for ID
3, 2010 for ID
2, and 2013 for ID
4 together in a new data frame.
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df2 <- data.frame(year = c(2010, 2012, 2010,2013),
ID = c(1,3,2,4))
Upvotes: 0
Views: 1753
Reputation: 16836
An alternative using base R is to use merge
.
library(lubridate)
df$year = lubridate::year(df$date)
merge(df, df2)
Since, df2
only has the two columns, then the merge is limited to using those columns. If we were to write out the columns that we want to merge by, then it would look something like this.
merge(df,df2[c("year","ID")],by=c("year","ID"))
Upvotes: 2
Reputation: 6206
Is this what you need?
Perhaps not the most elegant solution, but you can paste together the combinations of years and ID in both data.frames and then use one to filter the other. Probably not the best way if you have a large data.frame though.
df %>%
dplyr::filter(paste0(lubridate::year(date), "_", ID) %in% paste0(df2$year,"_", df2$ID))
date x y ID
1 2010-12-26 74119.46 839347.8 1
2 2010-12-27 72401.02 891788.1 2
3 2010-12-31 66940.94 810089.6 1
4 2012-01-02 68214.97 881200.1 3
5 2012-01-07 70595.92 863277.7 3
6 2012-01-12 79799.85 857738.5 3
7 2012-01-17 61102.50 848880.6 3
8 2012-01-22 71798.29 883455.7 3
9 2012-01-27 61550.93 889447.7 3
10 2012-02-01 69863.50 838101.4 3
11 2012-02-06 71202.38 873705.6 3
12 2012-02-11 60124.56 828661.6 3
13 2012-02-16 65963.74 824347.5 3
14 2012-02-21 79347.69 818929.1 3
15 2012-02-26 68082.87 879863.1 3
16 2012-03-02 68661.00 891477.0 3
17 2012-03-07 71369.69 849595.6 3
18 2012-03-12 73265.85 834035.4 3
19 2012-03-17 70777.06 833344.5 3
20 2012-03-22 72104.04 881329.5 3
21 2012-03-27 75471.59 848650.2 3
22 2012-04-01 77590.13 867834.6 3
23 2012-04-06 75664.27 828857.6 3
24 2012-04-11 65789.62 814059.0 3
25 2012-04-16 72841.91 893683.3 3
26 2012-04-21 61047.06 805820.7 3
27 2012-04-26 77232.51 896022.5 3
28 2012-05-01 77553.05 817557.6 3
29 2012-05-06 75597.76 899616.4 3
Perhaps a more efficient way would be to use a join:
df$year = lubridate::year(df$date)
dplyr::left_join(df2, df, by=c("ID", "year")) %>% na.omit()
Upvotes: 2