ralm
ralm

Reputation: 21

Multiple pair-wise string comparisons in R

I am trying to evaluate if the value of one (string) variable matches those in multiple other (string) variables in an R dataframe. If there is at least one valid match, I would like to return True; if not, I would like to return False.

Consider this toy dataframe:

toydf<-data.frame(
  base1=c("DOG","CAT","MOUSE"),
  base2=c("FISH","RAT","BUNNY"),
  target=c("DOG","HORSE","BUNNY"),
  stringsAsFactors=FALSE)

    base1 base2 target
  1   DOG  FISH    DOG
  2   CAT   RAT  HORSE
  3 MOUSE BUNNY  BUNNY

I want to compare the values in target with those in both base1 and base2 and return TRUE if there is at least one match, and FALSE otherwise:

    base1 base2 target check
  1   DOG  FISH    DOG  TRUE
  2   CAT   RAT  HORSE FALSE
  3 MOUSE BUNNY  BUNNY  TRUE

In this simple and small example, I know this can be easily achieved using:

toydf$check<-toydf$target==toydf$base1 | toydf$target==toydf$base2

However, in the actual dataset, I have a very large number of base variables against which to check for matches, so I'd like to avoid repeating these | statements.

I've attempted to achieve this using %in% but in order to do that, I first have to collect the values of base1 and base2 in a list or vector:

toydf$baseall<-apply(toydf[1:2],1,function(x) list(x))
toydf$check<-toydf$target %in% toydf$baseall

However, this returns a vector with all values to FALSE. I suspect this has something to do with the way the list is created in the dataframe, but I am not sure how to solve this.

Any help would be appreciated. Thank you.

Upvotes: 2

Views: 386

Answers (2)

RHertel
RHertel

Reputation: 23818

Here's another possibility:

toydf$check <- as.logical(rowSums(toydf==toydf$target)-1)
#> toydf
#  base1 base2 target check
#1   DOG  FISH    DOG  TRUE
#2   CAT   RAT  HORSE FALSE
#3 MOUSE BUNNY  BUNNY  TRUE

This code counts for each row of the dataframe the cases where an entry is equal to that specified in the column toydf$target. Since we did not exclude this target column from the dataframe, the sum is always at least one (the entry in the target column is obviously equal to itself), hence we need to correct this by subtracting 1. The result for each row is then converted into a Boolean FALSE or TRUE depending on whether the calculated value is zero (no entry in the other columns is equal to that in the target column) or not, respectively.

Hope this helps.

Upvotes: 2

Rick
Rick

Reputation: 898

# how about:
bool <- apply(toydf[,1:2], 2, FUN = "%in%", toydf$target)
toydf$check <- apply(bool, 1, any)

Upvotes: 0

Related Questions