Reputation: 923
I am trying to add % sign to numbers in the range of 0 to 10 in my data frame using regex
. The data frame is like the one shown below:
structure(list(comment = c("3.22%-1ST $100000 AND 1.15% BALANCE",
"3.25% 1ST $100000 AND 1.16% BALANCE", "3.22% 1ST 100000 AND 1.16 BALANCE",
"3.22% 1ST 100000 AND 1.15% BALANCE", "3.26-100 AND 1.16", "3.26-100 AND 1.16"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
1 3.22%-1ST $100000 AND 1.15% BALANCE
2 3.25% 1ST $100000 AND 1.16% BALANCE
3 3.22% 1ST 100000 AND 1.16 BALANCE
4 3.22% 1ST 100000 AND 1.15% BALANCE
5 3.26-100 AND 1.16
6 3.26-100 AND 1.16
So basically, I only want to add % to 1.16 in row 3 and 3.26 and 1.16 in rows 5 and 6. I wrote the code shown below:
tt$modified <- gsub("([0-9]\\.[0-9][0-9])", "\\1%", tt$comment)
but as it's shown below, this will add % to all numbers:
comment modified
<chr> <chr>
1 3.22%-1ST $100000 AND 1.15% BALANCE 3.22%%-1ST $100000 AND 1.15%% BALANCE
2 3.25% 1ST $100000 AND 1.16% BALANCE 3.25%% 1ST $100000 AND 1.16%% BALANCE
3 3.22% 1ST 100000 AND 1.16 BALANCE 3.22%% 1ST 100000 AND 1.16% BALANCE
4 3.22% 1ST 100000 AND 1.15% BALANCE 3.22%% 1ST 100000 AND 1.15%% BALANCE
5 3.26-100 AND 1.16 3.26%-100 AND 1.16%
6 3.26-100 AND 1.16 3.26%-100 AND 1.16%
How can I fix this issue?
Upvotes: 3
Views: 58
Reputation: 521914
You may make judicious use of lookarounds here to ensure that the percent sign only gets added where you want it:
df$comment <- gsub("\\b(\\d+\\.\\d+)\\b(?![%.])", "\\1%", df$comment, perl=TRUE)
df
comment
1 3.22%-1ST $100000 AND 1.15% BALANCE
2 3.25% 1ST $100000 AND 1.16% BALANCE
3 3.22% 1ST 100000 AND 1.16% BALANCE
4 3.22% 1ST 100000 AND 1.15% BALANCE
5 3.26%-100 AND 1.16%
6 3.26%-100 AND 1.16%
Note that I assume here that you only want to target decimal numbers. If you also might want to target integers, then we would need more information about the context of all replacements.
The regex pattern says to:
\b match a word boundary (start of the number)
( capture
\d+\.\d+ a number with a decimal component
) end capture
\b word boundary
(?![%.]) assert that what follows is NOT % or .
Note that the final negative lookahead prevents replacements from being made on numbers which already have %
, or the integer component of of a decimal number.
Upvotes: 3
Reputation: 389135
You can try with an optional %
sign to remove double %
.
tt$modified <- gsub("([0-9]\\.[0-9][0-9])%?", "\\1%", tt$comment)
tt
# A tibble: 6 x 2
# comment modified
# <chr> <chr>
#1 3.22%-1ST $100000 AND 1.15% BALANCE 3.22%-1ST $100000 AND 1.15% BALANCE
#2 3.25% 1ST $100000 AND 1.16% BALANCE 3.25% 1ST $100000 AND 1.16% BALANCE
#3 3.22% 1ST 100000 AND 1.16 BALANCE 3.22% 1ST 100000 AND 1.16% BALANCE
#4 3.22% 1ST 100000 AND 1.15% BALANCE 3.22% 1ST 100000 AND 1.15% BALANCE
#5 3.26-100 AND 1.16 3.26%-100 AND 1.16%
#6 3.26-100 AND 1.16 3.26%-100 AND 1.16%
Upvotes: 2