Reputation: 19
I have two sets of dataframes. One is combinations of strings where there are two columns, with different types of food:
#df.combination
[,1] [,2]
[1,] "Apple" "Orange"
[2,] "Apple" "Pear"
[3,] "Apple" "Avocado"
[4,] "Orange" "Pear"
[5,] "Orange" "Avocado"
[6,] "Pear" "Avocado"
The other is a big "main" dataframe that has three columns of food ("id" "date" "food1" "food2" "food3") containing some of these combinations:
#df.main
[,1] [,2] [,3] [,4] [,5]
[1,] "1234" "3/29" "Sala" "Pear" "Avocado"
[2,] "1235" "3/30" "Apple" "Pear" "Meat"
[3,] "1236" "4/1" "Orange" "Juice" "Apple"
[4,] "1237" "4/2" "Pear" "Avocado""Turkey"
If I wanted to write a script that searches df.main and selects rows containing all elements from df.combination[1,], (so "Apple" and "Orange"), how would I be able to do that? The foods do not have to be in any order. The row just needs to contain the food. (i.e. df.main[3,]).
Here is an example output I would like to see. If I search for "Orange" and "Apple" (so df.combination[1,]) in df.main, I would like to see the id of row df.main[2,]
#search df.main for row containing df.combination[1,]
#output:
#1236
Thank you! Any help really appreciated.
Upvotes: 0
Views: 2082
Reputation: 380
For this purpose, you can write a function that extends the match function to filter on records that contain all the values of a certain vector, as follows:
match_filter <- function(df, match_to) {
apply(df,
1,
function(row) {
!any(is.na(match(match_to, row)))
})
}
So, the function match_filter
takes 2 arguments: The first is df
, which in this case would be your df.main
dataset (or a subset of it - as we shall see). The second is match_to
, which is the vector we want to match to, or we want to have all of its values in each record in df
.
The function is easy to understand, however, it's useful to look at some of its components. What the match
function is doing in this case is for each value in the match_to
vector, return its index in the row
vector (which is a row in df
). If that value isn't found in row
, then it returns NA
. This is an example of running match on the second row of the dataset you provided:
> match(df.combination[1,], df.main[2,])
[1] 3 NA
Now, in order for the record row
to satisfy what we want in this case, then the vector returned by the match
function shouldn't have any NA
values. That's why we wrapping the match(match_to, row)
function with !any(is.na())
. So, if there are any NA
s in the return value of match
, then we want to discard that record.
The reason we are wrapping all of this in an apply
function is because we want to run the internal function (function(row){...}
) on each row of df
and that's what the apply
functions are used for.
This is an example of using the function defined above on your dataset:
> df.main[ match_filter(df.main[,3:5], df.combination[1,]),]
id date food1 food2 food3
3 1236 4/1 Orange Juice Apple
> subset(df.main, match_filter(df.main[,3:5], df.combination[1,]))
id date food1 food2 food3
3 1236 4/1 Orange Juice Apple
As you can see, since in this case you're interested in columns 3:5
in the df.main
dataframe, we just pass those to the match_filter
function.
Update:
If you'd like to use a function that can take any number of combinations, then we need to update match_filter
to accommodate that. The update is really simple and follows the same logic we've seen before:
match_filter <- function(df, match_to) {
apply(df,
1,
function(row1) {
any(apply(match_to,
1,
function(row2) {
!any(is.na(match(row2, row1)))
}))
})
}
Now the match_filter
function is a more generic function that can take any number of combinations and check if row1
has any of those combinations. And match_to
in this case can be a dataframe, corresponding to df.combination
. Here are a few examples using the new function:
First, I added this hypothetical record for illustration purposes:
df.main[5,] <- c("1238", "4/3", "Apple", "Avocado", "Orange" )
Here are the examples of how the function may be used:
# Example showing how the function
# works with only the first row of df.combination
> df.main[ match_filter(df.main[,3:5], df.combination[1,]),]
id date food1 food2 food3
3 1236 4/1 Orange Juice Apple
5 1238 4/3 Apple Avocado Orange
# The first 2 rows of df.combination
> df.main[ match_filter(df.main[,3:5], df.combination[1:2,]),]
id date food1 food2 food3
2 1235 3/30 Apple Pear Meat
3 1236 4/1 Orange Juice Apple
5 1238 4/3 Apple Avocado Orange
# All the rows in the df.combination dataframe
> df.main[ match_filter(df.main[,3:5], df.combination),]
id date food1 food2 food3
1 1234 3/29 Sala Pear Avocado
2 1235 3/30 Apple Pear Meat
3 1236 4/1 Orange Juice Apple
4 1237 4/2 Pear Avocado Turkey
5 1238 4/3 Apple Avocado Orange
Upvotes: 0
Reputation: 886938
You could try
f1 <- function(dat1, dat2, rowindex){
Indx <- apply(dat1[,grep('food', colnames(dat1))], 1,
function(x) all(unlist(dat2[rowindex,]) %in% x))
dat1[Indx,1]
}
f1(df.main, df.combination,1)
#[1] 1236
f1(df.main, df.combination,2)
#[1] 1235
f1(df.main, df.combination,3)
#integer(0)
df.main <- structure(list(id = 1234:1237, date = c("3/29", "3/30",
"4/1",
"4/2"), food1 = c("Sala", "Apple", "Orange", "Pear"),
food2 = c("Pear",
"Pear", "Juice", "Avocado"), food3 = c("Avocado", "Meat", "Apple",
"Turkey")), .Names = c("id", "date", "food1", "food2",
"food3"), class = "data.frame", row.names = c(NA, -4L))
df.combination <- structure(list(V1 = c("Apple", "Apple", "Apple",
"Orange", "Orange",
"Pear"), V2 = c("Orange", "Pear", "Avocado", "Pear", "Avocado",
"Avocado")), .Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -6L))
Upvotes: 1