Shannon
Shannon

Reputation: 135

Create a column based on multiple patterns

I have a column with 692 degree titles that I need to categorize as: certificate, associate, bachelor, master or higher. There is a lot of inconsistency in a degree titles. For example, a BS degree might include BS, B S, BS in Nursing, BSE, B.S. Accounting, Bachelor of Science, Bachelor of Science in Genetics, and so on. Each one need to be categorized as "Bachelor".

I've tried to detect as much of the string as possible using str_detect, but have not been very successful. How would I detect these different types of degrees?

What I have What I need
Bachelor of Science Bachelor
BA Bachelor
BFA Bachelor
Certificate in Nursing Certificate
Associates in Art Associate
AA Associate
MS Master or higher
Masters of Art Master or higher

Upvotes: 1

Views: 50

Answers (1)

jpdugo17
jpdugo17

Reputation: 7116

Maybe something like this?

library(tidyverse)

df <- 
tibble::tribble(
            ~What.I.have,       ~What.I.need,
   "Bachelor of Science",         "Bachelor",
                    "BA",         "Bachelor",
                   "BFA",         "Bachelor",
"Certificate in Nursing",      "Certificate",
     "Associates in Art",        "Associate",
                    "AA",        "Associate",
                    "MS", "Master or higher",
        "Masters of Art", "Master or higher"
)

df %>% mutate(new = case_when(str_detect(What.I.have, 'Bachelor|BA|BFA') ~ 'Bachelor',
                              str_detect(What.I.have, 'Certificate') ~ 'Certificate',
                              str_detect(What.I.have, 'Associates|AA') ~ 'Associate',
                              str_detect(What.I.have, 'Masters|MS') ~ 'Master or higher'))
#> # A tibble: 8 × 3
#>   What.I.have            What.I.need      new             
#>   <chr>                  <chr>            <chr>           
#> 1 Bachelor of Science    Bachelor         Bachelor        
#> 2 BA                     Bachelor         Bachelor        
#> 3 BFA                    Bachelor         Bachelor        
#> 4 Certificate in Nursing Certificate      Certificate     
#> 5 Associates in Art      Associate        Associate       
#> 6 AA                     Associate        Associate       
#> 7 MS                     Master or higher Master or higher
#> 8 Masters of Art         Master or higher Master or higher

Created on 2022-01-04 by the reprex package (v2.0.1)

What @Onyambu suggested in the comments will also yield the correct results for this data.

library(tidyverse)

df %>%
  transmute(new = case_when(
    str_detect(What.I.have, "^B") ~ "Bachelor",
    str_detect(What.I.have, "^C") ~ "Certificate",
    str_detect(What.I.have, "^A") ~ "Associate",
    str_detect(What.I.have, "^M") ~ "Master or higher"
  ))
#> # A tibble: 8 × 1
#>   new             
#>   <chr>           
#> 1 Bachelor        
#> 2 Bachelor        
#> 3 Bachelor        
#> 4 Certificate     
#> 5 Associate       
#> 6 Associate       
#> 7 Master or higher
#> 8 Master or higher

Created on 2022-01-04 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions