Yellow_truffle
Yellow_truffle

Reputation: 923

How to find a pattern in a string and extract it as a new column of data frame

I have a data frame as shown below:

c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
"3.2% 1ST $100000 1.1% BALANCE")

[1] "3.2% 1ST $100000 AND 1.1% BALANCE"                                
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"                                                            
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[6] "3.2% 1ST $100000 1.1% BALANCE"   

Generally, I can say that all comments are in this form: x.y% 1ST abcdef AND a.b% BALANCE (it could have a fourth number like a bonus. This data frame has 52000 rows, so definitely it's not possible to capture all sorts of comments but in general, the format I mentioned above is a good starting format). I am wondering how I can extract each number in these comments and save them in a data frame format. For example, I want to have a data frame like this:

    First%   cut-off second%  Bonus 
1    3.2     100000    1.1      NA
2    3.3     100000    1.2      3000
3    NA        NA      NA       NA
4    3.3     100000    1.2      NA
5    3.3     100000    1.2      NA
6    3.2     100000    1.1      NA

I can probably use str_subset and a pattern like this:

str_subset(data$comment, "(\\d\\.\\d)% 1ST \\$(\\d{3,8}) AND (\\d\\.\\d)% BALANCE \\w+")

but str_subset returns the string completly and I am not sure how I can save each part individually. I also checked this link Extract a regular expression match, it seems that str_extract is a good option; however, for only checking one pattern. In my case, if for instance I extract \\d\\.\\d%, it's not clear if the first number will be extracted or the second one.

Thanks

Upvotes: 1

Views: 1388

Answers (3)

Kent Orr
Kent Orr

Reputation: 504

I went with a dplyr pipe. The result isn't perfect, but it's a smoother way to fiddle with the regex matches piece by piece instead of a long string.

library(tidyverse)

data_ <- data.frame(dat = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
            "$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
            "3.2% 1ST $100000 1.1% BALANCE"))

data_ %>%
  mutate(first_perc = str_extract(dat, "\\d+\\.\\d%?"),
         cut_off = str_extract(dat, "[^\\d+\\.\\d% 1ST.]\\d+"),
         second_perc = str_extract(dat, "[^\\d+\\.\\d% 1ST?\\d+]?\\d+\\.\\d%"),
         bonus = str_extract(dat, "[^\\d+\\.\\d% 1ST?\\d+\\d+\\.\\d%\\D+]?\\d\\d+"))

I might suggest using conditionals to go back for the incomplete rows such as a case_when argument based on what did or did not get caught by the various mutated rows.

Upvotes: 1

Karthik S
Karthik S

Reputation: 11584

Does this work:

st <- c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
        "$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
        "3.2% 1ST $100000 1.1% BALANCE")
dat <- plyr::ldply(str_extract_all(st, '[0-9]\\.[0-9]|[0-9]{4,6}'), rbind)
names(dat) <- c('First%', 'cut-off', 'second%',  'Bonus')
dat <- dat %>% type.convert(as.is = T)
dat$`First%`[dat$`First%`>10] <- NA
dat
  First% cut-off second% Bonus
1    3.2  100000     1.1    NA
2    3.3  100000     1.2  3000
3     NA      NA      NA    NA
4    3.3  100000     1.2    NA
5    3.3  100000     1.2    NA
6    3.2  100000     1.1    NA

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388817

You can try the following :

library(tidyverse)

df %>%
  extract(col, c('First', 'cut-off', 'Second'), 
               '(\\d+.*?)% 1ST\\s*\\$(\\d+).*?(\\d+.*?)%.*?', remove = FALSE) %>%
  mutate(Bonus = str_extract(col, '\\d+(?=\\sBONUS)')) %>%
  select(-col)

#   First cut-off Second Bonus
#1   3.2  100000    1.1  <NA>
#2   3.3  100000    1.2  3000
#3  <NA>    <NA>   <NA>  <NA>
#4   3.3  100000    1.2  <NA>
#5   3.3  100000    1.2  <NA>
#6   3.2  100000    1.1  <NA>

data

df <- data.frame(col = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
                         "$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
                         "3.2% 1ST $100000 1.1% BALANCE"))

Upvotes: 4

Related Questions