Filter using dplyr based on condition as a loop

Question

I have two data frames. df1

    col1
 1  apples
 2 oranges
 3  apples
 4  banana

and df2

   setID    col1
1      1  apples
2      1 oranges
3      1  apples
4      1  banana
5      2  apples
6      2 oranges
7      2 oranges
8      2  apples
9      3 oranges
10     3  grapes
11     3  banana
12     3  banana
13     4  apples
21     4  oranges
31     4  apples
41     4 oranges

I used filter from dplyr package to narrow down df2 by df1$col1[1] and putting the result in tempdf

> tempdf <- df2 %>% group_by(setID) %>% filter(any(col1==df1$col1[1]))
> tempdf
    # A tibble: 12 x 2
# Groups:   setID [3]
   setID    col1
      
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana
 5     2  apples
 6     2 banana
 7     2 oranges
 8     2  apples
 9     4  apples
10     4  oranges
11     4  grapes
12     4 oranges

It needs to be grouped by setID because I’m trying to see which setID matches to df1 the best. Since setID = 3 doesn’t have any apples it is omitted in the first iteration. I want this to be a loop so that everytime its run the tempdf gets filtered more.

For the next element it should df1$col1[1:2] so that it checks “apples” and “oranges” in order. So after the second iteration tempdf should look like this.

setID    col1
      
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana
 5     4  apples
 6     4  oranges
 7     4  grapes
 8     4 oranges

The next iteration should check df1$col1[2:3] so it always check the previous element. “oranges” and “apples” this time.

 setID    col1
      
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana

Once it reaches just one unique setID the loop should end. And the result will be that df1 matches with setID=1 of df2.

MKR · Accepted Answer

A solution could be based on selecting setIDhaving maximum match on col1 as:

df2 %>% group_by(setID) %>% mutate( maxMatch = sum(col1==df1$col1)) %>% 
  ungroup() %>%
   filter(maxMatch == max(maxMatch)) %>%
  select(-maxMatch)

#Result
# A tibble: 4 x 2
  setID col1   
     
1     1 apples 
2     1 oranges
3     1 apples 
4     1 banana

Filter using dplyr based on condition as a loop

Answers (2)

Related Questions