Nix
Nix

Reputation: 149

Filter using dplyr based on condition as a loop

I have two data frames. df1

    col1
 1  apples
 2 oranges
 3  apples
 4  banana

and df2

   setID    col1
1      1  apples
2      1 oranges
3      1  apples
4      1  banana
5      2  apples
6      2 oranges
7      2 oranges
8      2  apples
9      3 oranges
10     3  grapes
11     3  banana
12     3  banana
13     4  apples
21     4  oranges
31     4  apples
41     4 oranges

I used filter from dplyr package to narrow down df2 by df1$col1[1] and putting the result in tempdf

> tempdf <- df2 %>% group_by(setID) %>% filter(any(col1==df1$col1[1]))
> tempdf
    # A tibble: 12 x 2
# Groups:   setID [3]
   setID    col1
   <dbl>   <chr>
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana
 5     2  apples
 6     2 banana
 7     2 oranges
 8     2  apples
 9     4  apples
10     4  oranges
11     4  grapes
12     4 oranges

It needs to be grouped by setID because I’m trying to see which setID matches to df1 the best. Since setID = 3 doesn’t have any apples it is omitted in the first iteration. I want this to be a loop so that everytime its run the tempdf gets filtered more.

For the next element it should df1$col1[1:2] so that it checks “apples” and “oranges” in order. So after the second iteration tempdf should look like this.

setID    col1
   <dbl>   <chr>
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana
 5     4  apples
 6     4  oranges
 7     4  grapes
 8     4 oranges

The next iteration should check df1$col1[2:3] so it always check the previous element. “oranges” and “apples” this time.

 setID    col1
   <dbl>   <chr>
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana

Once it reaches just one unique setID the loop should end. And the result will be that df1 matches with setID=1 of df2.

Upvotes: 1

Views: 1482

Answers (2)

MKR
MKR

Reputation: 20095

A solution could be based on selecting setIDhaving maximum match on col1 as:

df2 %>% group_by(setID) %>% mutate( maxMatch = sum(col1==df1$col1)) %>% 
  ungroup() %>%
   filter(maxMatch == max(maxMatch)) %>%
  select(-maxMatch)

#Result
# A tibble: 4 x 2
  setID col1   
  <int> <chr>  
1     1 apples 
2     1 oranges
3     1 apples 
4     1 banana 

Upvotes: 1

www
www

Reputation: 39184

Here is one idea to filter the setID. In this case it will return the setID as 1 directly without a loop. The downside is it will not return any results if there are no exact match of all the items in col1. However, it should be much faster than the for-loop approach.

library(dplyr)

df3 <- df1 %>% summarise(col1 = toString(col1))

df4 <- df2 %>%
  group_by(setID) %>%
  summarise(col1 = toString(col1)) %>%
  semi_join(df3, by = "col1")
df4
# # A tibble: 1 x 2
#   setID col1                           
#   <int> <chr>                          
# 1     1 apples, oranges, apples, banana

DATA

df1 <- read.table(text = "    col1
 1  apples
                  2 oranges
                  3  apples
                  4  banana",
                  header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = "   setID    col1
1      1  apples
                  2      1 oranges
                  3      1  apples
                  4      1  banana
                  5      2  apples
                  6      2 oranges
                  7      2 oranges
                  8      2  apples
                  9      3 oranges
                  10     3  grapes
                  11     3  banana
                  12     3  banana
                  13     4  apples
                  21     4  oranges
                  31     4  apples
                  41     4 oranges",
                  header = TRUE, stringsAsFactors = FALSE)

Upvotes: 1

Related Questions