user1769069
user1769069

Reputation: 19

Select from Multiple Columns - R

I have two sets of dataframes. One is combinations of strings where there are two columns, with different types of food:

#df.combination

      [,1]     [,2] 
[1,] "Apple" "Orange"         
[2,] "Apple" "Pear"         
[3,] "Apple" "Avocado" 
[4,] "Orange" "Pear"   
[5,] "Orange" "Avocado"
[6,] "Pear" "Avocado"

The other is a big "main" dataframe that has three columns of food ("id" "date" "food1" "food2" "food3") containing some of these combinations:

#df.main

      [,1]     [,2]     [,3]     [,4]      [,5]    
[1,] "1234"   "3/29"    "Sala"    "Pear"   "Avocado"
[2,] "1235"   "3/30"    "Apple"   "Pear"   "Meat"     
[3,] "1236"   "4/1"     "Orange"   "Juice"  "Apple" 
[4,] "1237"   "4/2"     "Pear"    "Avocado""Turkey" 

If I wanted to write a script that searches df.main and selects rows containing all elements from df.combination[1,], (so "Apple" and "Orange"), how would I be able to do that? The foods do not have to be in any order. The row just needs to contain the food. (i.e. df.main[3,]).

Here is an example output I would like to see. If I search for "Orange" and "Apple" (so df.combination[1,]) in df.main, I would like to see the id of row df.main[2,]

#search df.main for row containing df.combination[1,]
#output:
#1236

Thank you! Any help really appreciated.

Upvotes: 0

Views: 2082

Answers (2)

szabad
szabad

Reputation: 380

For this purpose, you can write a function that extends the match function to filter on records that contain all the values of a certain vector, as follows:

match_filter <- function(df, match_to) { 
    apply(df, 
          1, 
          function(row) {
              !any(is.na(match(match_to, row)))
          })
}

So, the function match_filter takes 2 arguments: The first is df, which in this case would be your df.main dataset (or a subset of it - as we shall see). The second is match_to, which is the vector we want to match to, or we want to have all of its values in each record in df.

The function is easy to understand, however, it's useful to look at some of its components. What the match function is doing in this case is for each value in the match_to vector, return its index in the row vector (which is a row in df). If that value isn't found in row, then it returns NA. This is an example of running match on the second row of the dataset you provided:

> match(df.combination[1,], df.main[2,])
[1]  3 NA

Now, in order for the record row to satisfy what we want in this case, then the vector returned by the match function shouldn't have any NA values. That's why we wrapping the match(match_to, row) function with !any(is.na()). So, if there are any NAs in the return value of match, then we want to discard that record.

The reason we are wrapping all of this in an apply function is because we want to run the internal function (function(row){...}) on each row of df and that's what the apply functions are used for.

This is an example of using the function defined above on your dataset:

> df.main[ match_filter(df.main[,3:5], df.combination[1,]),] 
    id date  food1 food2 food3
3 1236  4/1 Orange Juice Apple
> subset(df.main, match_filter(df.main[,3:5], df.combination[1,])) 
    id date  food1 food2 food3
3 1236  4/1 Orange Juice Apple

As you can see, since in this case you're interested in columns 3:5 in the df.main dataframe, we just pass those to the match_filter function.

Update:

If you'd like to use a function that can take any number of combinations, then we need to update match_filter to accommodate that. The update is really simple and follows the same logic we've seen before:

match_filter <- function(df, match_to) {
    apply(df,
          1,
          function(row1) {
              any(apply(match_to,
                    1,
                    function(row2) {
                        !any(is.na(match(row2, row1)))
                    }))
          })
}

Now the match_filter function is a more generic function that can take any number of combinations and check if row1 has any of those combinations. And match_to in this case can be a dataframe, corresponding to df.combination. Here are a few examples using the new function:

First, I added this hypothetical record for illustration purposes:

df.main[5,] <- c("1238",   "4/3",     "Apple",    "Avocado", "Orange" )

Here are the examples of how the function may be used:

# Example showing how the function 
# works with only the first row of df.combination
> df.main[ match_filter(df.main[,3:5], df.combination[1,]),] 
    id date  food1   food2  food3
3 1236  4/1 Orange   Juice  Apple
5 1238  4/3  Apple Avocado Orange
# The first 2 rows of df.combination 
> df.main[ match_filter(df.main[,3:5], df.combination[1:2,]),] 
    id date  food1   food2  food3
2 1235 3/30  Apple    Pear   Meat
3 1236  4/1 Orange   Juice  Apple
5 1238  4/3  Apple Avocado Orange
# All the rows in the df.combination dataframe
> df.main[ match_filter(df.main[,3:5], df.combination),] 
    id date  food1   food2   food3
1 1234 3/29   Sala    Pear Avocado
2 1235 3/30  Apple    Pear    Meat
3 1236  4/1 Orange   Juice   Apple
4 1237  4/2   Pear Avocado  Turkey
5 1238  4/3  Apple Avocado  Orange

Upvotes: 0

akrun
akrun

Reputation: 886938

You could try

 f1 <- function(dat1, dat2, rowindex){
  Indx <- apply(dat1[,grep('food', colnames(dat1))], 1,
         function(x) all(unlist(dat2[rowindex,]) %in% x))
  dat1[Indx,1]
 }
 f1(df.main, df.combination,1)
 #[1] 1236
 f1(df.main, df.combination,2)
 #[1] 1235
 f1(df.main, df.combination,3)
 #integer(0)

data

df.main <- structure(list(id = 1234:1237, date = c("3/29", "3/30",
"4/1", 
"4/2"), food1 = c("Sala", "Apple", "Orange", "Pear"), 
food2 = c("Pear", 
"Pear", "Juice", "Avocado"), food3 = c("Avocado", "Meat", "Apple", 
 "Turkey")), .Names = c("id", "date", "food1", "food2", 
 "food3"), class = "data.frame", row.names = c(NA, -4L))

df.combination <- structure(list(V1 = c("Apple", "Apple", "Apple", 
"Orange", "Orange", 
"Pear"), V2 = c("Orange", "Pear", "Avocado", "Pear", "Avocado", 
"Avocado")), .Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -6L))

Upvotes: 1

Related Questions