Yellow_truffle
Yellow_truffle

Reputation: 923

Replacing number in a string with rounded number using regex

I have a data frame that includes comments from different people (so it can be written in any form they want). The sample data frame is shown below (this is only a sample and my original data set has more than 50000 rows):

structure(list(comment = c("3.22%-1ST $100K/1.15% BAL", "3.25% ON 1ST $100,000/1.1625% ON BAL", 
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL", 
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL", 
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST $100K + 1.1625% BALANCE", 
"3.255% ON 1ST $100K & 1.1625% ON BALANCE", "3.225% ON 1ST $100,000 AND 1.16% ON BALANCE", 
"3.255% ON FIRST $100,000 AND 1.1625% ON BALANCE", "$4000", "$7,500", 
"$6,000", "$5,000", "$6000.00", "$10,000 PLUS BONUS $10,000", 
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST $100,000 + 1.15% ON BAL", 
"THE GREATER $3,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST $100,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA, 
-26L), class = c("tbl_df", "tbl", "data.frame"))

 1 3.22%-1ST $100K/1.15% BAL            TRUE 
 2 3.25% ON 1ST $100,000/1.1625% ON BAL TRUE 
 3 3.225% 1ST 100K/1.1625 ON BAL        TRUE 
 4 3.22% 1ST 100K/1.15% ON BAL          TRUE 
 5 3.255% 1ST 100K/1.1625% ON BAL       TRUE 
 6 3.2% 1ST 100K/1.15% ON BAL           TRUE 
 7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE 
 8 3.255% 1ST 100K/1.1625% ON BAL       TRUE 
 9 3.22% ON 1ST 100K / 1.1625% ON BAL   TRUE 
10 3.2% 1ST 100K/1.15% ON BAL           TRUE 
11 ............................         ....

As you can see this data frame has no specific format and that makes it difficult to work with. What I want to do: I want to change all number like 3.255%, 3.2%, 3.22%, 4, etc (basically, numbers in a range of 0 to 5 in each of the comments to the same format like x.yz% format.

What are the challenges? the main challenge here is that some rows don't start with numbers like $4000 or "THE GREATER $3,000 OR .5% OF SALE PRICE" which are obviously in a different format. One approach is to separate rows that start with a digit and label them with "TRUE" or "FALSE" for now. I wrote the command below(not sure if this is a good idea, though!):

df_com$a <- str_detect(df_com$comment, pattern = "^\\d")

However, using this we may miss a line like "THE GREATER $3,000 OR .5% OF SALE PRICE". This line should be changed to "THE GREATER $3,000 OR .50% OF SALE PRICE".

Also, in order to replace number and round them, I followed the answer explained here: round all float numbers in a string and modified it in the form shown below to do the task:

gsubfn("(\\d\\w{1:3})", ~format(round(as.numeric(x), 2), nsmall=2), x)

However, this expression doesn't work.

The expected result is something in this form:

1 3.22%-1ST $100K/1.15% BAL              TRUE 
2 3.25% ON 1ST $100,000/1.16% ON BAL     TRUE 
3 3.23% 1ST 100K/1.16 ON BAL             TRUE 
4 3.22% 1ST 100K/1.15% ON BAL            TRUE 
5 3.26% 1ST 100K/1.16% ON BAL            TRUE 
6 3.20% 1ST 100K/1.15% ON BAL            TRUE 
7 3.22% ON 1ST 100K & 1.15% ON BALANCE   TRUE 
8 3.26% 1ST 100K/1.16% ON BAL            TRUE 
9 3.22% ON 1ST 100K / 1.16% ON BAL       TRUE 
10 3.20% 1ST 100K/1.15% ON BAL           TRUE 
11 ............................         ....

Any advice on how I can do this task?

Upvotes: 2

Views: 110

Answers (2)

TimTeaFan
TimTeaFan

Reputation: 18581

We can use str_replace_all from the {stringr} package and use a function as replacement argument. See my updated answer below if it fits your needs.

library(dplyr)
library(stringr)

dat <- structure(list(comment = c("3.22%-1ST $100K/1.15% BAL", "3.25% ON 1ST $100,000/1.1625% ON BAL", 
                                  "3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL", 
                                  "3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
                                  "3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL", 
                                  "3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
                                  "3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST $100K + 1.1625% BALANCE", 
                                  "3.255% ON 1ST $100K & 1.1625% ON BALANCE", "3.225% ON 1ST $100,000 AND 1.16% ON BALANCE", 
                                  "3.255% ON FIRST $100,000 AND 1.1625% ON BALANCE", "$4000", "$7,500", 
                                  "$6,000", "$5,000", "$6000.00", "$10,000 PLUS BONUS $10,000", 
                                  "4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST $100,000 + 1.15% ON BAL", 
                                  "THE GREATER $3,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST $100,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
         TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, 
         FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA, 
                                                                       -26L), class = c("tbl_df", "tbl", "data.frame"))

dat2 <- dat %>% 
  mutate(comment2 =
           str_replace_all(comment,
                           "[0-9]\\.[0-9]*",
                           function(x) format(round(as.numeric(x), 2), nsmall = 2)) %>% 
           str_replace_all("\\s\\.[0-9]*",
                           function(x) paste0(" ", format(round(as.numeric(x), 2), nsmall = 2))) %>% 
           str_replace_all("([0-9])-",
                           "\\1.00%-")
  ) 

tail(dat2)
#> # A tibble: 6 x 3
#>   comment                               a     comment2                          
#>   <chr>                                 <lgl> <chr>                             
#> 1 $10,000 PLUS BONUS $10,000            FALSE $10,000 PLUS BONUS $10,000        
#> 2 4-100/1.1625                          TRUE  4.00%-100/1.16                    
#> 3 3.2% 1ST 100K/1.15% ON BAL            TRUE  3.20% 1ST 100K/1.15% ON BAL       
#> 4 3.2% ON 1ST $100,000 + 1.15% ON BAL   TRUE  3.20% ON 1ST $100,000 + 1.15% ON ~
#> 5 THE GREATER $3,000 OR .5% OF SALE PR~ FALSE THE GREATER $3,000 OR 0.50% OF SA~
#> 6 **3.255% ON THE 1ST $100,000 AND 1.1~ FALSE **3.26% ON THE 1ST $100,000 AND 1~

Created on 2020-11-05 by the reprex package (v0.3.0)

Upvotes: 1

r2evans
r2evans

Reputation: 161110

In base R, this is a great place to use gregexpr and regmatches:

gre <- gregexpr("\\b?[0-9]*\\.[0-9]*(?=%)", df_com$comment, perl = TRUE)
str(regmatches(df_com$comment, gre))
# List of 26
#  $ : chr [1:2] "3.22" "1.15"
#  $ : chr [1:2] "3.25" "1.16"
#  $ : chr "3.23"
#  $ : chr [1:2] "3.22" "1.15"
#  $ : chr [1:2] "3.25" "1.16"
#  $ : chr [1:2] "3.20" "1.15"
#  $ : chr [1:2] "3.22" "1.15"
#  $ : chr [1:2] "3.25" "1.16"
#  $ : chr [1:2] "3.22" "1.16"
#  $ : chr [1:2] "3.20" "1.15"
#  $ : chr [1:2] "3.20" "1.15"
#  $ : chr [1:2] "3.20" "1.16"
#  $ : chr [1:2] "3.25" "1.16"
#  $ : chr [1:2] "3.23" "1.16"
#  $ : chr [1:2] "3.25" "1.16"
#  $ : chr(0) 
#  $ : chr(0) 
#  $ : chr(0) 
#  $ : chr(0) 
#  $ : chr(0) 
#  $ : chr(0) 
#  $ : chr(0) 
#  $ : chr [1:2] "3.20" "1.15"
#  $ : chr [1:2] "3.20" "1.15"
#  $ : chr(0) 
#  $ : chr [1:2] "3.25" "1.16"
regmatches(df_com$comment, gre) <-
  lapply(regmatches(df_com$comment, gre), function(nums) {
    format(round(as.numeric(nums), 2), nsmall=2)
  })

The result:

df_com
# # A tibble: 26 x 2
#    comment                              a    
#    <chr>                                <lgl>
#  1 3.22%-1ST $100K/1.15% BAL            TRUE 
#  2 3.25% ON 1ST $100,000/1.16% ON BAL   TRUE 
#  3 3.23% 1ST 100K/1.1625 ON BAL         TRUE 
#  4 3.22% 1ST 100K/1.15% ON BAL          TRUE 
#  5 3.25% 1ST 100K/1.16% ON BAL          TRUE 
#  6 3.20% 1ST 100K/1.15% ON BAL          TRUE 
#  7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE 
#  8 3.25% 1ST 100K/1.16% ON BAL          TRUE 
#  9 3.22% ON 1ST 100K / 1.16% ON BAL     TRUE 
# 10 3.20% 1ST 100K/1.15% ON BAL          TRUE 
# # ... with 16 more rows

I used the literal of x.yz%, which explains why there is a single 1.1625 unchanged in row 3.

Upvotes: 3

Related Questions