Reputation: 149
I have two data frames. df1
col1
1 apples
2 oranges
3 apples
4 banana
and df2
setID col1
1 1 apples
2 1 oranges
3 1 apples
4 1 banana
5 2 apples
6 2 oranges
7 2 oranges
8 2 apples
9 3 oranges
10 3 grapes
11 3 banana
12 3 banana
13 4 apples
21 4 oranges
31 4 apples
41 4 oranges
I used filter from dplyr package to narrow down df2 by df1$col1[1] and putting the result in tempdf
> tempdf <- df2 %>% group_by(setID) %>% filter(any(col1==df1$col1[1]))
> tempdf
# A tibble: 12 x 2
# Groups: setID [3]
setID col1
<dbl> <chr>
1 1 apples
2 1 oranges
3 1 apples
4 1 banana
5 2 apples
6 2 banana
7 2 oranges
8 2 apples
9 4 apples
10 4 oranges
11 4 grapes
12 4 oranges
It needs to be grouped by setID because I’m trying to see which setID matches to df1 the best. Since setID = 3 doesn’t have any apples it is omitted in the first iteration. I want this to be a loop so that everytime its run the tempdf gets filtered more.
For the next element it should df1$col1[1:2] so that it checks “apples” and “oranges” in order. So after the second iteration tempdf should look like this.
setID col1
<dbl> <chr>
1 1 apples
2 1 oranges
3 1 apples
4 1 banana
5 4 apples
6 4 oranges
7 4 grapes
8 4 oranges
The next iteration should check df1$col1[2:3] so it always check the previous element. “oranges” and “apples” this time.
setID col1
<dbl> <chr>
1 1 apples
2 1 oranges
3 1 apples
4 1 banana
Once it reaches just one unique setID the loop should end. And the result will be that df1 matches with setID=1 of df2.
Upvotes: 1
Views: 1482
Reputation: 20095
A solution could be based on selecting setID
having maximum match on col1 as:
df2 %>% group_by(setID) %>% mutate( maxMatch = sum(col1==df1$col1)) %>%
ungroup() %>%
filter(maxMatch == max(maxMatch)) %>%
select(-maxMatch)
#Result
# A tibble: 4 x 2
setID col1
<int> <chr>
1 1 apples
2 1 oranges
3 1 apples
4 1 banana
Upvotes: 1
Reputation: 39184
Here is one idea to filter the setID
. In this case it will return the setID
as 1 directly without a loop. The downside is it will not return any results if there are no exact match of all the items in col1
. However, it should be much faster than the for-loop approach.
library(dplyr)
df3 <- df1 %>% summarise(col1 = toString(col1))
df4 <- df2 %>%
group_by(setID) %>%
summarise(col1 = toString(col1)) %>%
semi_join(df3, by = "col1")
df4
# # A tibble: 1 x 2
# setID col1
# <int> <chr>
# 1 1 apples, oranges, apples, banana
DATA
df1 <- read.table(text = " col1
1 apples
2 oranges
3 apples
4 banana",
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = " setID col1
1 1 apples
2 1 oranges
3 1 apples
4 1 banana
5 2 apples
6 2 oranges
7 2 oranges
8 2 apples
9 3 oranges
10 3 grapes
11 3 banana
12 3 banana
13 4 apples
21 4 oranges
31 4 apples
41 4 oranges",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 1