Conditioned based string matching using grepl and ifelse

Question

I have a dataframe df, mentioned below.

a <- c(1:6)
b <- c("Audi,BMW,Skoda, Rackets,Toy,Football",
       "Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby",
       "Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet",
       "Lemon,Yamaha,Table,Kawasaki,Chair,Fruits", 
       "Ford, chevrolet,Bread,Ducati,Tesla,Hyundai",
       "Honey,Apple,Alcohol,cake,Sweets, Mango")
       df <- data.frame(a,b)

*

I also have two list containing brand name of cars and bikes.

cars <- c("Audi","BMW","Ford","Skoda","Mazda","chevrolet","Mercedes","Volkswagen","Tesla","Hyundai","Lamborghini","Mini-Cooper","Lexus")
motorbike <- c("Yamaha","Suzuki","Kawasaki","Harley-Davidson","Ducati","Aprilia","KTM", "Triumph","Piaggio","Hyosung","Vespa","MV-Agusta")

I used grepl with ifelse to match the words from the two list in df$b and assign a value to each rows if they have a match.

df$c<-ifelse(grepl(paste(cars, collapse="|"), df$b), "cars",
      ifelse(grepl(paste(motorbike, collapse="|"),df$b), "bikes","others"))

Now, I want to put a condition that if 4 or more than 4 words are matching in each row, only then a value (car,bike) is assigned in df$c. I want my df to be like this:

structure(list(a = 1:6, b = structure(c(1L, 6L, 5L, 4L, 2L, 3L
), .Label = c("Audi,BMW,Skoda, Rackets,Toy,Football", "Ford, chevrolet,Bread,Ducati,Tesla,Hyundai", 
"Honey,Apple,Alcohol,cake,Sweets, Mango", "Lemon,Yamaha,Table,Kawasaki,Chair,Fruits", 
"Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet", "Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby"
), class = "factor"), c = c("others", "bikes", "cars", "others", 
"cars", "others")), row.names = c(NA, 6L), class = "data.frame")

Lennyy · Accepted Answer

Does this help? Of course you can delete the amountcars and amountmotors columns. And do you expect it will never occur you have both >3 cars and >3 motors in a string? Based on comment, I have now updated my answer.

library(stringr)
df$amountcars <- str_count(df$b, paste(cars, collapse="|"))
df$amountmotors <- str_count(df$b, paste(motorbike, collapse="|"))



df$c <- ifelse(df$amountcars > 3 & df$amountcars > df$amountmotors, "cars", ifelse(df$amountmotors > 3 & df$amountmotors > df$amountcars, "bikes", "others"))
df

  a                                              b amountcars amountmotors      c
1 1           Audi,BMW,Skoda, Rackets,Toy,Football          3            0 others
2 2 Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby          0            4  bikes
3 3  Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet          4            0   cars
4 4       Lemon,Yamaha,Table,Kawasaki,Chair,Fruits          0            2 others
5 5     Ford, chevrolet,Bread,Ducati,Tesla,Hyundai          4            1   cars
6 6         Honey,Apple,Alcohol,cake,Sweets, Mango          0            0 others

Based on comments if you have like 9 strings: First create all vectors with strings:

cars <- c("Audi","BMW","Ford","Skoda","Mazda","chevrolet","Mercedes","Volkswagen","Tesla","Hyundai","Lamborghini","Mini-Cooper","Lexus")
motorbike <- c("Yamaha","Suzuki","Kawasaki","Harley-Davidson","Ducati","Aprilia","KTM", "Triumph","Piaggio","Hyosung","Vespa","MV-Agusta")

Then put these in a list, and add the names

list1 <- list(cars, motorbike)
names(list1) <- c("cars", "motorbike")

Finally, run this code:

df$d <- 
ifelse(apply(sapply(list1, function(x) str_count(df$b, paste0(x, collapse = "|"))), 1, max) > 3,
apply(sapply(list1, function(x) str_count(df$b, paste0(x, collapse = "|"))), 1, function(x) names(list1)[which.max(x)]),
"others")

Basically, it calculates the max number of strings from one of the vectors, and if it is above 3, it assigns the appropriate name, otherwise it assigns "others".

Conditioned based string matching using grepl and ifelse

Answers (1)

Related Questions