N Kevin
N Kevin

Reputation: 95

How can extract text by matching letters and characters?

I have a data frame containing

Drug name
コージネイトFSバイオセット注250 250国際単位
アドベイト注射用500 500単位

I want to extract the the Japanese drug names and volume to create two new columns,

Drug_clean   Volume
コージネイト    250
アドベイト   500

In order to do this, I plan to identify the letter of F and specific character "注", but I don't know how to do that. Can you please tell me how can I achieve it?

Thank you.

Upvotes: 1

Views: 113

Answers (2)

SEAnalyst
SEAnalyst

Reputation: 1211

With a character vector, you can use strsplit() from base with a | separating the different delimiters. From your example, you want the 1st element of each of these splits which the unlisted lapply() provides.

df<- rbind("コージネイトFSバイオセット注250 250国際単位",
          "アドベイト注射用500 500単位")

#extract the columns
library(dplyr)
Drug_clean <-strsplit(df,"F|注") %>% lapply(., `[[`, 1) %>% unlist()
Volume <- str_extract(df, "[378254619].{2}")

tibble(Drug_clean,Volume)
> tibble(Drug_clean,Volume)
# A tibble: 2 × 2
  Drug_clean   Volume
  <chr>        <chr> 
1 コージネイト 250
2 アドベイト   500

For getting as.numeric() for Volume column, follow @Donald Seinen's excellent switch() code.

Upvotes: 3

Donald Seinen
Donald Seinen

Reputation: 4419

A few hurdles here - one is to extract the matches, and two is to convert Unicode to numeric. We can do this by defining a small translation function, and setting the appropriate locale.

df <- tibble(drug_name = c("コージネイトFSバイオセット注250 250国際単位", "アドベイト注射用500 500単位"))

library(stringr)
library(dplyr)
tmcn::setchs() # to set locale (Chinese here, might need an appropriate Japanese instead)

translate <- Vectorize(function(x){
  x <- strsplit(x, "")
  as.list(x[[1]]) %>%
    lapply(function(x){
      switch(x,
             "3" = 3, "7" = 7, "8" = 8, "2" = 2, "5" = 5,
             "4" = 4, "6" = 6, "1" = 1, "9" = 9, "0" = 0, NA
      )}) %>%
    paste0(collapse = "") %>% as.numeric()
})

df %>%
  transmute(
    Drug_clean = ifelse(str_detect(drug_name, "F"),
                        str_extract(drug_name, ".*(?=F)"),
                        str_extract(drug_name, ".*(?=注)")),
    Volume = translate(str_extract(drug_name, "[3782546190].{2}"))
  )

#>  A tibble: 2 x 2
#>  Drug_clean   Volume
#>  <chr>         <dbl>
#> 1 コージネイト    250
#> 2 アドベイト      500

Upvotes: 4

Related Questions