Fraxxx
Fraxxx

Reputation: 114

Conditioned based string matching using grepl and ifelse

I have a dataframe df, mentioned below.

a <- c(1:6)
b <- c("Audi,BMW,Skoda, Rackets,Toy,Football",
       "Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby",
       "Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet",
       "Lemon,Yamaha,Table,Kawasaki,Chair,Fruits", 
       "Ford, chevrolet,Bread,Ducati,Tesla,Hyundai",
       "Honey,Apple,Alcohol,cake,Sweets, Mango")
       df <- data.frame(a,b)

*

I also have two list containing brand name of cars and bikes.

cars <- c("Audi","BMW","Ford","Skoda","Mazda","chevrolet","Mercedes","Volkswagen","Tesla","Hyundai","Lamborghini","Mini-Cooper","Lexus")
motorbike <- c("Yamaha","Suzuki","Kawasaki","Harley-Davidson","Ducati","Aprilia","KTM", "Triumph","Piaggio","Hyosung","Vespa","MV-Agusta")

I used grepl with ifelse to match the words from the two list in df$b and assign a value to each rows if they have a match.

df$c<-ifelse(grepl(paste(cars, collapse="|"), df$b), "cars",
      ifelse(grepl(paste(motorbike, collapse="|"),df$b), "bikes","others"))

Now, I want to put a condition that if 4 or more than 4 words are matching in each row, only then a value (car,bike) is assigned in df$c. I want my df to be like this:

structure(list(a = 1:6, b = structure(c(1L, 6L, 5L, 4L, 2L, 3L
), .Label = c("Audi,BMW,Skoda, Rackets,Toy,Football", "Ford, chevrolet,Bread,Ducati,Tesla,Hyundai", 
"Honey,Apple,Alcohol,cake,Sweets, Mango", "Lemon,Yamaha,Table,Kawasaki,Chair,Fruits", 
"Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet", "Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby"
), class = "factor"), c = c("others", "bikes", "cars", "others", 
"cars", "others")), row.names = c(NA, 6L), class = "data.frame") 

Upvotes: 0

Views: 84

Answers (1)

Lennyy
Lennyy

Reputation: 6132

Does this help? Of course you can delete the amountcars and amountmotors columns. And do you expect it will never occur you have both >3 cars and >3 motors in a string? Based on comment, I have now updated my answer.

library(stringr)
df$amountcars <- str_count(df$b, paste(cars, collapse="|"))
df$amountmotors <- str_count(df$b, paste(motorbike, collapse="|"))



df$c <- ifelse(df$amountcars > 3 & df$amountcars > df$amountmotors, "cars", ifelse(df$amountmotors > 3 & df$amountmotors > df$amountcars, "bikes", "others"))
df

  a                                              b amountcars amountmotors      c
1 1           Audi,BMW,Skoda, Rackets,Toy,Football          3            0 others
2 2 Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby          0            4  bikes
3 3  Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet          4            0   cars
4 4       Lemon,Yamaha,Table,Kawasaki,Chair,Fruits          0            2 others
5 5     Ford, chevrolet,Bread,Ducati,Tesla,Hyundai          4            1   cars
6 6         Honey,Apple,Alcohol,cake,Sweets, Mango          0            0 others

Based on comments if you have like 9 strings: First create all vectors with strings:

cars <- c("Audi","BMW","Ford","Skoda","Mazda","chevrolet","Mercedes","Volkswagen","Tesla","Hyundai","Lamborghini","Mini-Cooper","Lexus")
motorbike <- c("Yamaha","Suzuki","Kawasaki","Harley-Davidson","Ducati","Aprilia","KTM", "Triumph","Piaggio","Hyosung","Vespa","MV-Agusta")

Then put these in a list, and add the names

list1 <- list(cars, motorbike)
names(list1) <- c("cars", "motorbike")

Finally, run this code:

df$d <- 
ifelse(apply(sapply(list1, function(x) str_count(df$b, paste0(x, collapse = "|"))), 1, max) > 3,
apply(sapply(list1, function(x) str_count(df$b, paste0(x, collapse = "|"))), 1, function(x) names(list1)[which.max(x)]),
"others")

Basically, it calculates the max number of strings from one of the vectors, and if it is above 3, it assigns the appropriate name, otherwise it assigns "others".

Upvotes: 2

Related Questions