John Huang
John Huang

Reputation: 845

Filtering data frame by multiple columns from another data frame

I have two data frames, and I am trying to filter out one of the data frames using the two columns from the other one. In my example, I am trying to filter out the data by both the year and ID column in df using the df2 data frame. For example:

I would like to filter out all the dates that have ID 1 in 2010, ID 3 in 2012, ID 2 in 2010, and ID 4 in 2013 from the data frame df. So that the output only has the dates from 2010 for ID 1, only 2012 for ID 3, 2010 for ID 2, and 2013 for ID 4 together in a new data frame.


library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)

df <- data.frame(date = date,
                 x = runif(length(date), min = 60000, max = 80000),
                 y = runif(length(date), min = 800000, max = 900000),
                 ID)

df2 <- data.frame(year = c(2010, 2012, 2010,2013),
                  ID = c(1,3,2,4))

Upvotes: 0

Views: 1753

Answers (2)

AndrewGB
AndrewGB

Reputation: 16836

An alternative using base R is to use merge.

library(lubridate)
df$year = lubridate::year(df$date)

merge(df, df2)

Since, df2 only has the two columns, then the merge is limited to using those columns. If we were to write out the columns that we want to merge by, then it would look something like this.

merge(df,df2[c("year","ID")],by=c("year","ID"))

Upvotes: 2

user438383
user438383

Reputation: 6206

Is this what you need?

Perhaps not the most elegant solution, but you can paste together the combinations of years and ID in both data.frames and then use one to filter the other. Probably not the best way if you have a large data.frame though.

df %>% 
    dplyr::filter(paste0(lubridate::year(date), "_", ID) %in% paste0(df2$year,"_", df2$ID))

         date        x        y ID
1  2010-12-26 74119.46 839347.8  1
2  2010-12-27 72401.02 891788.1  2
3  2010-12-31 66940.94 810089.6  1
4  2012-01-02 68214.97 881200.1  3
5  2012-01-07 70595.92 863277.7  3
6  2012-01-12 79799.85 857738.5  3
7  2012-01-17 61102.50 848880.6  3
8  2012-01-22 71798.29 883455.7  3
9  2012-01-27 61550.93 889447.7  3
10 2012-02-01 69863.50 838101.4  3
11 2012-02-06 71202.38 873705.6  3
12 2012-02-11 60124.56 828661.6  3
13 2012-02-16 65963.74 824347.5  3
14 2012-02-21 79347.69 818929.1  3
15 2012-02-26 68082.87 879863.1  3
16 2012-03-02 68661.00 891477.0  3
17 2012-03-07 71369.69 849595.6  3
18 2012-03-12 73265.85 834035.4  3
19 2012-03-17 70777.06 833344.5  3
20 2012-03-22 72104.04 881329.5  3
21 2012-03-27 75471.59 848650.2  3
22 2012-04-01 77590.13 867834.6  3
23 2012-04-06 75664.27 828857.6  3
24 2012-04-11 65789.62 814059.0  3
25 2012-04-16 72841.91 893683.3  3
26 2012-04-21 61047.06 805820.7  3
27 2012-04-26 77232.51 896022.5  3
28 2012-05-01 77553.05 817557.6  3
29 2012-05-06 75597.76 899616.4  3

Perhaps a more efficient way would be to use a join:

df$year = lubridate::year(df$date)
dplyr::left_join(df2, df, by=c("ID", "year")) %>% na.omit()

Upvotes: 2

Related Questions