Yellow_truffle
Yellow_truffle

Reputation: 923

How to use `regex` to add % sign to a string only for specific strings that don't have it

I am trying to add % sign to numbers in the range of 0 to 10 in my data frame using regex. The data frame is like the one shown below:

structure(list(comment = c("3.22%-1ST $100000 AND 1.15% BALANCE", 
"3.25%  1ST $100000 AND 1.16%  BALANCE", "3.22% 1ST 100000 AND 1.16  BALANCE", 
"3.22% 1ST 100000 AND 1.15%  BALANCE", "3.26-100 AND 1.16", "3.26-100 AND 1.16"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

1 3.22%-1ST $100000 AND 1.15% BALANCE  
2 3.25%  1ST $100000 AND 1.16%  BALANCE
3 3.22% 1ST 100000 AND 1.16  BALANCE   
4 3.22% 1ST 100000 AND 1.15%  BALANCE  
5 3.26-100 AND 1.16                    
6 3.26-100 AND 1.16

So basically, I only want to add % to 1.16 in row 3 and 3.26 and 1.16 in rows 5 and 6. I wrote the code shown below:

tt$modified <- gsub("([0-9]\\.[0-9][0-9])", "\\1%", tt$comment)

but as it's shown below, this will add % to all numbers:

  comment                               modified                               
  <chr>                                 <chr>                                  
1 3.22%-1ST $100000 AND 1.15% BALANCE   3.22%%-1ST $100000 AND 1.15%% BALANCE  
2 3.25%  1ST $100000 AND 1.16%  BALANCE 3.25%%  1ST $100000 AND 1.16%%  BALANCE
3 3.22% 1ST 100000 AND 1.16  BALANCE    3.22%% 1ST 100000 AND 1.16%  BALANCE   
4 3.22% 1ST 100000 AND 1.15%  BALANCE   3.22%% 1ST 100000 AND 1.15%%  BALANCE  
5 3.26-100 AND 1.16                     3.26%-100 AND 1.16%                    
6 3.26-100 AND 1.16                     3.26%-100 AND 1.16% 

How can I fix this issue?

Upvotes: 3

Views: 58

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521914

You may make judicious use of lookarounds here to ensure that the percent sign only gets added where you want it:

df$comment <- gsub("\\b(\\d+\\.\\d+)\\b(?![%.])", "\\1%", df$comment, perl=TRUE)
df

                                comment
1   3.22%-1ST $100000 AND 1.15% BALANCE
2 3.25%  1ST $100000 AND 1.16%  BALANCE
3   3.22% 1ST 100000 AND 1.16%  BALANCE
4   3.22% 1ST 100000 AND 1.15%  BALANCE
5                   3.26%-100 AND 1.16%
6                   3.26%-100 AND 1.16%

Note that I assume here that you only want to target decimal numbers. If you also might want to target integers, then we would need more information about the context of all replacements.

The regex pattern says to:

\b            match a word boundary (start of the number)
(             capture
    \d+\.\d+  a number with a decimal component
)             end capture
\b            word boundary
(?![%.])      assert that what follows is NOT % or .

Note that the final negative lookahead prevents replacements from being made on numbers which already have %, or the integer component of of a decimal number.

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 389135

You can try with an optional % sign to remove double %.

tt$modified <- gsub("([0-9]\\.[0-9][0-9])%?", "\\1%", tt$comment)
tt

# A tibble: 6 x 2
#  comment                               modified                             
#  <chr>                                 <chr>                                
#1 3.22%-1ST $100000 AND 1.15% BALANCE   3.22%-1ST $100000 AND 1.15% BALANCE  
#2 3.25%  1ST $100000 AND 1.16%  BALANCE 3.25%  1ST $100000 AND 1.16%  BALANCE
#3 3.22% 1ST 100000 AND 1.16  BALANCE    3.22% 1ST 100000 AND 1.16%  BALANCE  
#4 3.22% 1ST 100000 AND 1.15%  BALANCE   3.22% 1ST 100000 AND 1.15%  BALANCE  
#5 3.26-100 AND 1.16                     3.26%-100 AND 1.16%                  
#6 3.26-100 AND 1.16                     3.26%-100 AND 1.16%          

Upvotes: 2

Related Questions