Simon
Simon

Reputation: 1111

Mutate dataframe and carry out partial string match

Suppose you have a string-heavy dataframe:

     x <- data.frame(name = c("Alice", "Alice", "Alice", "Bob", "Bob", "Charlie"),
                    prod = c("Hard Hat", "Goggles", "Bus Fare", "Goggles", "Training", "Hard Hat, Laptop"))


How can you add a mutated column (let's call it category) to this dataframe to categorise the data based on some arbitrary criteria. For example how can I set x$category to equal "PPE" if the word 'Hard Hat' or 'Goggles' appears in x$prod but equal "IT" if the word 'Laptop' appears in x$prod?

In addition, I would like the matching to also handle partial matches and different cases, if possible. For example, 'Bus Fare' could also be input as (non-exhaustive list) 'Bus Ticket', or 'BUS FARE' or 'Bus TICKET'; in either case, I'd need to categorize it as 'Transport' as the word 'Bus' will be present.

Expected output:

    name     prod  category
1   Alice Hard Hat       PPE
2   Alice  Goggles       PPE
3   Alice Bus Fare TRANSPORT
4     Bob  Goggles       PPE
5     Bob Training  TRAINING
6 Charlie   Laptop        IT

I would ideally like to solve this within tidyverse and I think it will require a combination of mutate() and various stringr functions but I can't quite figure out the exact workflow I will require.

Upvotes: 2

Views: 490

Answers (1)

paqmo
paqmo

Reputation: 3729

Given your situation, you will probably need to create a vector of keywords for each category and use str_detect using concatenated | statements:

x <- data.frame(name = c("Alice", "Alice", "Alice", "Bob", "Bob", "Charlie"),
                prod = c("Hard Hat", "Goggles", "Bus Fare", "Goggles", "Training", "Hard Hat, Laptop"))


transport <- c("bus")
ppe <- c("goggles", "hard hat")
tech <- c("laptop")
training <- c("training")

x <- x %>% 
  mutate(
    category = 
      case_when(
        str_detect(tolower(prod), paste(transport, collapse = "|")) ~ "TRANSPORT",
        str_detect(tolower(prod), paste(ppe, collapse = "|")) ~ "PPE",
        str_detect(tolower(prod), paste(tech, collapse = "|")) ~ "IT",
        str_detect(tolower(prod), paste(training, collapse = "|")) ~ "TRAINING",
      )
  )

Result:

> x
     name             prod  category
1   Alice         Hard Hat       PPE
2   Alice          Goggles       PPE
3   Alice         Bus Fare TRANSPORT
4     Bob          Goggles       PPE
5     Bob         Training  TRAINING
6 Charlie Hard Hat, Laptop       PPE

Upvotes: 2

Related Questions