Nix
Nix

Reputation: 149

Match observations in two dataframes in R

I have two dataframes. I want to use elements from one dataframe to search through a column from the other dataframe. And I need to narrow down this dataframe by the matches. And then continue narrowing down element by element. Look to the sample code, which can explain better.

df1    col1   

1      apples      
2      oranges     
3      apples    
4      banana  
5      grapes
6      mangoes
7      oranges
8      banana

df1 has only one column in it. Meanwhile df2 has 2 columns in it. setID & col1

df2 setID   col1

1   1   apples      
2   1   oranges     
3   1   oranges
4   1   mangoes
5   1   grapes
6   1   banana  
7   1   banana
8   1   apples    
10  2   apples      
11  2   oranges     
12  2   apples    
13  2   banana  
14  2   grapes
15  2   mangoes
16  2   banana
17  2   oranges
18  3   apples      
19  3   banana  
20  3   oranges     
21  3   apples    
22  3   grapes
23  3   mangoes
24  3   oranges
25  3   banana
26  4   apples      
27  4   oranges     
28  4   apples    
29  4   grapes
30  4   grapes
31  4   oranges     
32  4   banana  
33  4   banana

As you can see there are some repeating setIDs. They mark one set. The order of the set is important. Please note that the df1$col1 does not have to be the same length as a set from df2. Nor do they have to be an exact match. They just have to be a close enough match. In this case df1$col1 is closest a match to df2$setID = 2 with only the last two elements out of order. The reason why they dont have to be an exact match is because I want to use a "search as you type" approach. I do not want to match df1$col1 as it is to a setID on df2. I want to narrow down the possible set by going through element by element. Assume that you get the elements of df1 one by one and not as a complete dataframe. For example:

Find a match for df1$col1[1] from df2 and save any sets that contains the match to a tempdf. It doesnt matter if a match for df1$col1[1] is found more than once in the same set. If it is found at least once then that set will be added to tempdf.

What needs to be retrieved at the end is a setID that corresponds to the set that matches as close to df1. In this case the tempdf will be the same as df2 as all the sets include "apples". Next will be what matches df1$col1[2] against the tempdf given that the first element is a match. I guess df1$col1[1:2] from tempdf. This results in:

tempdf  setID   col1

1   1   apples      
2   1   oranges     
3   1   oranges
4   1   mangoes
5   1   grapes
6   1   banana  
7   1   banana
8   1   apples    
10  2   apples      
11  2   oranges     
12  2   apples    
13  2   banana  
14  2   grapes
15  2   mangoes
16  2   banana
17  2   oranges
26  4   apples      
27  4   oranges     
28  4   apples    
29  4   grapes
30  4   grapes
31  4   oranges     
32  4   banana  
33  4   banana

Basically setID = 3 is omitted. As this continues with the 3rd element from df1 the new tempdf will contain only setID 2 & 4. The loop (my thinking to solve this) would end once only one setID remains, in this case setID = 2. Therefore setID = 2 would be considered as a close match for df1.

Of course feel free to advice on a better approach than this one.

Upvotes: 0

Views: 1036

Answers (3)

Uwe
Uwe

Reputation: 42592

The OP has requested to find setID groups in df2 where the values in col1 are exactly the same as in df2.

For the sake of completeness, here is also a data.table approach:

library(data.table)
tmp <- setDT(df2)[, all(col1 == df1$col1), by = setID][(V1)]
tmp
   setID   V1
1:     1 TRUE

Now, the OP has requested to return the matching rows. This can be accomplished by either looking for matching values of setID

df2[setID %in% tmp$setID]
   setID    col1
1:     1  apples
2:     1 oranges
3:     1  apples
4:     1  banana

or by joining (which presumably might be faster on large tables)

df2[tmp, on = "setID", .SD]

returning the same result.

Caveat

The sample datasets provided by the OP suggests that the number of rows in df1 is the same as in each setID group in df2. The OP has not specified the expected result in case the number of rows differ.

Upvotes: 0

Onyambu
Onyambu

Reputation: 79338

using base R:

split(df2,df2[,1])[by(df2[2],df2[1],function(x)all(x==df1))]
 $`1`
   setID    col1
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

You might want to look at the "compare" package, which would allow you to compare allowing for different transformations.

Here are a couple of examples to consider....

Starting sample data. Note setID == 4, which has all the values, but in the wrong order.

df1 <- data.frame(col1 = c("apples", "oranges", "apples", "banana"),
                  stringsAsFactors = FALSE)
df1
##      col1
## 1  apples
## 2 oranges
## 3  apples
## 4  banana

df2 <- structure(list(setID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 
    4, 4, 4, 4), col1 = c("apples", "oranges", "apples", "banana", 
    "apples", "grapes", "oranges", "apples", "oranges", "grapes", 
    "banana", "banana", "apples", "apples", "banana", "oranges")), 
    .Names = c("setID", "col1"), 
    row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
    "9", "10", "11", "12", "13", "21", "31", "41"), class = "data.frame")
df2
##    setID    col1
## 1      1  apples
## 2      1 oranges
## 3      1  apples
## 4      1  banana
## 5      2  apples
## 6      2  grapes
## 7      2 oranges
## 8      2  apples
## 9      3 oranges
## 10     3  grapes
## 11     3  banana
## 12     3  banana
## 13     4  apples
## 21     4  apples
## 31     4  banana
## 41     4 oranges

Load "compare" and do some comparisons:

library(compare)
lapply(split(df2[, "col1", drop = FALSE], df2$setID), 
       function(x) compare(df1, x))
## $`1`
## TRUE
## 
## $`2`
## FALSE [FALSE]
## 
## $`3`
## FALSE [FALSE]
## 
## $`4`
## FALSE [FALSE]
## 

Allow all transformations before comparison (see ?compare for details if you want to allow only for certain transformations).

lapply(split(df2[, "col1", drop = FALSE], df2$setID), 
       function(x) compare(df1, x, allowAll = TRUE))
## $`1`
## TRUE
## 
## $`2`
## FALSE [FALSE]
##   sorted
##   [col1] ignored case
##   renamed rows
##   [col1] ignored case
##   dropped row names
##   [col1] ignored case
## 
## $`3`
## FALSE [FALSE]
##   sorted
##   [col1] ignored case
##   renamed rows
##   [col1] ignored case
##   dropped row names
##   [col1] ignored case
## 
## $`4`
## TRUE
##   sorted
##   renamed rows
##   dropped row names
## 

Upvotes: 1

Related Questions