Replacing number in a string with rounded number using regex

Question

I have a data frame that includes comments from different people (so it can be written in any form they want). The sample data frame is shown below (this is only a sample and my original data set has more than 50000 rows):

structure(list(comment = c("3.22%-1ST $100K/1.15% BAL", "3.25% ON 1ST $100,000/1.1625% ON BAL", 
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL", 
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL", 
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST $100K + 1.1625% BALANCE", 
"3.255% ON 1ST $100K & 1.1625% ON BALANCE", "3.225% ON 1ST $100,000 AND 1.16% ON BALANCE", 
"3.255% ON FIRST $100,000 AND 1.1625% ON BALANCE", "$4000", "$7,500", 
"$6,000", "$5,000", "$6000.00", "$10,000 PLUS BONUS $10,000", 
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST $100,000 + 1.15% ON BAL", 
"THE GREATER $3,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST $100,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA, 
-26L), class = c("tbl_df", "tbl", "data.frame"))

 1 3.22%-1ST $100K/1.15% BAL            TRUE 
 2 3.25% ON 1ST $100,000/1.1625% ON BAL TRUE 
 3 3.225% 1ST 100K/1.1625 ON BAL        TRUE 
 4 3.22% 1ST 100K/1.15% ON BAL          TRUE 
 5 3.255% 1ST 100K/1.1625% ON BAL       TRUE 
 6 3.2% 1ST 100K/1.15% ON BAL           TRUE 
 7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE 
 8 3.255% 1ST 100K/1.1625% ON BAL       TRUE 
 9 3.22% ON 1ST 100K / 1.1625% ON BAL   TRUE 
10 3.2% 1ST 100K/1.15% ON BAL           TRUE 
11 ............................         ....

As you can see this data frame has no specific format and that makes it difficult to work with. What I want to do: I want to change all number like 3.255%, 3.2%, 3.22%, 4, etc (basically, numbers in a range of 0 to 5 in each of the comments to the same format like x.yz% format.

What are the challenges? the main challenge here is that some rows don't start with numbers like $4000 or "THE GREATER $3,000 OR .5% OF SALE PRICE" which are obviously in a different format. One approach is to separate rows that start with a digit and label them with "TRUE" or "FALSE" for now. I wrote the command below(not sure if this is a good idea, though!):

df_com$a <- str_detect(df_com$comment, pattern = "^\d")

However, using this we may miss a line like "THE GREATER $3,000 OR .5% OF SALE PRICE". This line should be changed to "THE GREATER $3,000 OR .50% OF SALE PRICE".

Also, in order to replace number and round them, I followed the answer explained here: round all float numbers in a string and modified it in the form shown below to do the task:

gsubfn("(\d\w{1:3})", ~format(round(as.numeric(x), 2), nsmall=2), x)

However, this expression doesn't work.

The expected result is something in this form:

1 3.22%-1ST $100K/1.15% BAL              TRUE 
2 3.25% ON 1ST $100,000/1.16% ON BAL     TRUE 
3 3.23% 1ST 100K/1.16 ON BAL             TRUE 
4 3.22% 1ST 100K/1.15% ON BAL            TRUE 
5 3.26% 1ST 100K/1.16% ON BAL            TRUE 
6 3.20% 1ST 100K/1.15% ON BAL            TRUE 
7 3.22% ON 1ST 100K & 1.15% ON BALANCE   TRUE 
8 3.26% 1ST 100K/1.16% ON BAL            TRUE 
9 3.22% ON 1ST 100K / 1.16% ON BAL       TRUE 
10 3.20% 1ST 100K/1.15% ON BAL           TRUE 
11 ............................         ....

Any advice on how I can do this task?

TimTeaFan · Accepted Answer

We can use str_replace_all from the {stringr} package and use a function as replacement argument. See my updated answer below if it fits your needs.

library(dplyr)
library(stringr)

dat <- structure(list(comment = c("3.22%-1ST $100K/1.15% BAL", "3.25% ON 1ST $100,000/1.1625% ON BAL", 
                                  "3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL", 
                                  "3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
                                  "3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL", 
                                  "3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL", 
                                  "3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST $100K + 1.1625% BALANCE", 
                                  "3.255% ON 1ST $100K & 1.1625% ON BALANCE", "3.225% ON 1ST $100,000 AND 1.16% ON BALANCE", 
                                  "3.255% ON FIRST $100,000 AND 1.1625% ON BALANCE", "$4000", "$7,500", 
                                  "$6,000", "$5,000", "$6000.00", "$10,000 PLUS BONUS $10,000", 
                                  "4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST $100,000 + 1.15% ON BAL", 
                                  "THE GREATER $3,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST $100,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
         TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, 
         FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA, 
                                                                       -26L), class = c("tbl_df", "tbl", "data.frame"))

dat2 <- dat %>% 
  mutate(comment2 =
           str_replace_all(comment,
                           "[0-9]\.[0-9]*",
                           function(x) format(round(as.numeric(x), 2), nsmall = 2)) %>% 
           str_replace_all("\s\.[0-9]*",
                           function(x) paste0(" ", format(round(as.numeric(x), 2), nsmall = 2))) %>% 
           str_replace_all("([0-9])-",
                           "\1.00%-")
  ) 

tail(dat2)
#> # A tibble: 6 x 3
#>   comment                               a     comment2                          
#>                                                                  
#> 1 $10,000 PLUS BONUS $10,000            FALSE $10,000 PLUS BONUS $10,000        
#> 2 4-100/1.1625                          TRUE  4.00%-100/1.16                    
#> 3 3.2% 1ST 100K/1.15% ON BAL            TRUE  3.20% 1ST 100K/1.15% ON BAL       
#> 4 3.2% ON 1ST $100,000 + 1.15% ON BAL   TRUE  3.20% ON 1ST $100,000 + 1.15% ON ~
#> 5 THE GREATER $3,000 OR .5% OF SALE PR~ FALSE THE GREATER $3,000 OR 0.50% OF SA~
#> 6 **3.255% ON THE 1ST $100,000 AND 1.1~ FALSE **3.26% ON THE 1ST $100,000 AND 1~

^{Created on 2020-11-05 by the reprex package (v0.3.0)}

Replacing number in a string with rounded number using regex

Answers (2)

Related Questions