Reputation: 300
I have to classify a list of products like these:
product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg) cow','chicken breast','noodles','salad','chicken salad with egg'))
Based on the words included in each element of this vector:
product_to_match<-c('cow meat','deer meat','cow milk','chicken breast','chicken egg salad','anana')
I would have to match all the words of each product product_to_match, into each element of the dataframe.
I am not sure what is the best way to do this, in order to classify each product into a new column, in order to have something like this:
product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg)
cow','chicken breast','noodles','salad','chicken salad with egg'),class=c(NA,'cow meat','chicken
breast',NA,NA,'chicken egg salad'))
Notice that 'anana' did not match with 'banana', eventhough the characers are included in the string but not the word. I am not sure how to do this.
Thank you.
Upvotes: 2
Views: 74
Reputation: 886948
Using stringdist
could get some matches
library(fuzzyjoin)
stringdist_left_join(product_list, tibble(product = product_to_match),
method = 'soundex')
Upvotes: 0
Reputation: 101099
Perhaps this could help
q <- outer(
strsplit(product_to_match, "\\s+"),
strsplit(product_list$product, "\\s+"),
FUN = Vectorize(function(x, y) all(x %in% y))
)
product_list$class <- product_to_match[replace(colSums(q * row(q)), colSums(q) == 0, NA)]
such that
> product_list
product class
1 banana from ecuador 1 unit <NA>
2 argentinian meat (1 kg) cow cow meat
3 chicken breast chicken breast
4 noodles <NA>
5 salad <NA>
6 chicken salad with egg chicken egg salad
Upvotes: 4