Reputation: 923
I have a data frame that includes comments from different people (so it can be written in any form they want). The sample data frame is shown below (this is only a sample and my original data set has more than 50000 rows):
structure(list(comment = c("3.22%-1ST $100K/1.15% BAL", "3.25% ON 1ST $100,000/1.1625% ON BAL",
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL",
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL",
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST $100K + 1.1625% BALANCE",
"3.255% ON 1ST $100K & 1.1625% ON BALANCE", "3.225% ON 1ST $100,000 AND 1.16% ON BALANCE",
"3.255% ON FIRST $100,000 AND 1.1625% ON BALANCE", "$4000", "$7,500",
"$6,000", "$5,000", "$6000.00", "$10,000 PLUS BONUS $10,000",
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST $100,000 + 1.15% ON BAL",
"THE GREATER $3,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST $100,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA,
-26L), class = c("tbl_df", "tbl", "data.frame"))
1 3.22%-1ST $100K/1.15% BAL TRUE
2 3.25% ON 1ST $100,000/1.1625% ON BAL TRUE
3 3.225% 1ST 100K/1.1625 ON BAL TRUE
4 3.22% 1ST 100K/1.15% ON BAL TRUE
5 3.255% 1ST 100K/1.1625% ON BAL TRUE
6 3.2% 1ST 100K/1.15% ON BAL TRUE
7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
8 3.255% 1ST 100K/1.1625% ON BAL TRUE
9 3.22% ON 1ST 100K / 1.1625% ON BAL TRUE
10 3.2% 1ST 100K/1.15% ON BAL TRUE
11 ............................ ....
As you can see this data frame has no specific format and that makes it difficult to work with.
What I want to do: I want to change all number like 3.255%, 3.2%, 3.22%, 4, etc (basically, numbers in a range of 0 to 5 in each of the comments to the same format like x.yz%
format.
What are the challenges? the main challenge here is that some rows don't start with numbers like $4000 or "THE GREATER $3,000 OR .5% OF SALE PRICE" which are obviously in a different format. One approach is to separate rows that start with a digit and label them with "TRUE" or "FALSE" for now. I wrote the command below(not sure if this is a good idea, though!):
df_com$a <- str_detect(df_com$comment, pattern = "^\\d")
However, using this we may miss a line like "THE GREATER $3,000 OR .5% OF SALE PRICE". This line should be changed to "THE GREATER $3,000 OR .50% OF SALE PRICE".
Also, in order to replace number and round them, I followed the answer explained here: round all float numbers in a string and modified it in the form shown below to do the task:
gsubfn("(\\d\\w{1:3})", ~format(round(as.numeric(x), 2), nsmall=2), x)
However, this expression doesn't work.
The expected result is something in this form:
1 3.22%-1ST $100K/1.15% BAL TRUE
2 3.25% ON 1ST $100,000/1.16% ON BAL TRUE
3 3.23% 1ST 100K/1.16 ON BAL TRUE
4 3.22% 1ST 100K/1.15% ON BAL TRUE
5 3.26% 1ST 100K/1.16% ON BAL TRUE
6 3.20% 1ST 100K/1.15% ON BAL TRUE
7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
8 3.26% 1ST 100K/1.16% ON BAL TRUE
9 3.22% ON 1ST 100K / 1.16% ON BAL TRUE
10 3.20% 1ST 100K/1.15% ON BAL TRUE
11 ............................ ....
Any advice on how I can do this task?
Upvotes: 2
Views: 110
Reputation: 18581
We can use str_replace_all
from the {stringr} package and use a function as replacement
argument. See my updated answer below if it fits your needs.
library(dplyr)
library(stringr)
dat <- structure(list(comment = c("3.22%-1ST $100K/1.15% BAL", "3.25% ON 1ST $100,000/1.1625% ON BAL",
"3.225% 1ST 100K/1.1625 ON BAL", "3.22% 1ST 100K/1.15% ON BAL",
"3.255% 1ST 100K/1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.22% ON 1ST 100K & 1.15% ON BALANCE", "3.255% 1ST 100K/1.1625% ON BAL",
"3.22% ON 1ST 100K / 1.1625% ON BAL", "3.2% 1ST 100K/1.15% ON BAL",
"3.2% 1ST 100K/1.15% ON BAL", "3.2% 1ST $100K + 1.1625% BALANCE",
"3.255% ON 1ST $100K & 1.1625% ON BALANCE", "3.225% ON 1ST $100,000 AND 1.16% ON BALANCE",
"3.255% ON FIRST $100,000 AND 1.1625% ON BALANCE", "$4000", "$7,500",
"$6,000", "$5,000", "$6000.00", "$10,000 PLUS BONUS $10,000",
"4-100/1.1625", "3.2% 1ST 100K/1.15% ON BAL", "3.2% ON 1ST $100,000 + 1.15% ON BAL",
"THE GREATER $3,000 OR .5% OF SALE PRICE", "**3.255% ON THE 1ST $100,000 AND 1.1625% ON THE BALANCE"
), a = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)), row.names = c(NA,
-26L), class = c("tbl_df", "tbl", "data.frame"))
dat2 <- dat %>%
mutate(comment2 =
str_replace_all(comment,
"[0-9]\\.[0-9]*",
function(x) format(round(as.numeric(x), 2), nsmall = 2)) %>%
str_replace_all("\\s\\.[0-9]*",
function(x) paste0(" ", format(round(as.numeric(x), 2), nsmall = 2))) %>%
str_replace_all("([0-9])-",
"\\1.00%-")
)
tail(dat2)
#> # A tibble: 6 x 3
#> comment a comment2
#> <chr> <lgl> <chr>
#> 1 $10,000 PLUS BONUS $10,000 FALSE $10,000 PLUS BONUS $10,000
#> 2 4-100/1.1625 TRUE 4.00%-100/1.16
#> 3 3.2% 1ST 100K/1.15% ON BAL TRUE 3.20% 1ST 100K/1.15% ON BAL
#> 4 3.2% ON 1ST $100,000 + 1.15% ON BAL TRUE 3.20% ON 1ST $100,000 + 1.15% ON ~
#> 5 THE GREATER $3,000 OR .5% OF SALE PR~ FALSE THE GREATER $3,000 OR 0.50% OF SA~
#> 6 **3.255% ON THE 1ST $100,000 AND 1.1~ FALSE **3.26% ON THE 1ST $100,000 AND 1~
Created on 2020-11-05 by the reprex package (v0.3.0)
Upvotes: 1
Reputation: 161110
In base R, this is a great place to use gregexpr
and regmatches
:
gre <- gregexpr("\\b?[0-9]*\\.[0-9]*(?=%)", df_com$comment, perl = TRUE)
str(regmatches(df_com$comment, gre))
# List of 26
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr "3.23"
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.22" "1.15"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.22" "1.16"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.16"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr [1:2] "3.23" "1.16"
# $ : chr [1:2] "3.25" "1.16"
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr(0)
# $ : chr [1:2] "3.20" "1.15"
# $ : chr [1:2] "3.20" "1.15"
# $ : chr(0)
# $ : chr [1:2] "3.25" "1.16"
regmatches(df_com$comment, gre) <-
lapply(regmatches(df_com$comment, gre), function(nums) {
format(round(as.numeric(nums), 2), nsmall=2)
})
The result:
df_com
# # A tibble: 26 x 2
# comment a
# <chr> <lgl>
# 1 3.22%-1ST $100K/1.15% BAL TRUE
# 2 3.25% ON 1ST $100,000/1.16% ON BAL TRUE
# 3 3.23% 1ST 100K/1.1625 ON BAL TRUE
# 4 3.22% 1ST 100K/1.15% ON BAL TRUE
# 5 3.25% 1ST 100K/1.16% ON BAL TRUE
# 6 3.20% 1ST 100K/1.15% ON BAL TRUE
# 7 3.22% ON 1ST 100K & 1.15% ON BALANCE TRUE
# 8 3.25% 1ST 100K/1.16% ON BAL TRUE
# 9 3.22% ON 1ST 100K / 1.16% ON BAL TRUE
# 10 3.20% 1ST 100K/1.15% ON BAL TRUE
# # ... with 16 more rows
I used the literal of x.yz%
, which explains why there is a single 1.1625
unchanged in row 3.
Upvotes: 3