How can extract text by matching letters and characters?

Question

I have a data frame containing

Drug name
コージネイトＦＳバイオセット注２５０　２５０国際単位
アドベイト注射用５００　５００単位

I want to extract the the Japanese drug names and volume to create two new columns,

Drug_clean   Volume
コージネイト    250
アドベイト　　　500

In order to do this, I plan to identify the letter of F and specific character "注", but I don't know how to do that. Can you please tell me how can I achieve it?

Thank you.

Donald Seinen · Accepted Answer

A few hurdles here - one is to extract the matches, and two is to convert Unicode to numeric. We can do this by defining a small translation function, and setting the appropriate locale.

df <- tibble(drug_name = c("コージネイトＦＳバイオセット注２５０　２５０国際単位", "アドベイト注射用５００　５００単位"))

library(stringr)
library(dplyr)
tmcn::setchs() # to set locale (Chinese here, might need an appropriate Japanese instead)

translate <- Vectorize(function(x){
  x <- strsplit(x, "")
  as.list(x[[1]]) %>%
    lapply(function(x){
      switch(x,
             "３" = 3, "７" = 7, "８" = 8, "２" = 2, "５" = 5,
             "４" = 4, "６" = 6, "１" = 1, "９" = 9, "０" = 0, NA
      )}) %>%
    paste0(collapse = "") %>% as.numeric()
})

df %>%
  transmute(
    Drug_clean = ifelse(str_detect(drug_name, "Ｆ"),
                        str_extract(drug_name, ".*(?=Ｆ)"),
                        str_extract(drug_name, ".*(?=注)")),
    Volume = translate(str_extract(drug_name, "[３７８２５４６１９０].{2}"))
  )

#>  A tibble: 2 x 2
#>  Drug_clean   Volume
#>           
#> 1 コージネイト    250
#> 2 アドベイト      500

How can extract text by matching letters and characters?

Answers (2)

Related Questions