Reputation: 923
I have a data frame as shown below:
c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE")
[1] "3.2% 1ST $100000 AND 1.1% BALANCE"
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"
[6] "3.2% 1ST $100000 1.1% BALANCE"
Generally, I can say that all comments are in this form:
x.y% 1ST abcdef AND a.b% BALANCE
(it could have a fourth number like a bonus. This data frame has 52000 rows, so definitely it's not possible to capture all sorts of comments but in general, the format I mentioned above is a good starting format).
I am wondering how I can extract each number in these comments and save them in a data frame format. For example, I want to have a data frame like this:
First% cut-off second% Bonus
1 3.2 100000 1.1 NA
2 3.3 100000 1.2 3000
3 NA NA NA NA
4 3.3 100000 1.2 NA
5 3.3 100000 1.2 NA
6 3.2 100000 1.1 NA
I can probably use str_subset
and a pattern like this:
str_subset(data$comment, "(\\d\\.\\d)% 1ST \\$(\\d{3,8}) AND (\\d\\.\\d)% BALANCE \\w+")
but str_subset
returns the string completly and I am not sure how I can save each part individually. I also checked this link Extract a regular expression match, it seems that str_extract
is a good option; however, for only checking one pattern. In my case, if for instance I extract \\d\\.\\d%
, it's not clear if the first number will be extracted or the second one.
Thanks
Upvotes: 1
Views: 1388
Reputation: 504
I went with a dplyr pipe. The result isn't perfect, but it's a smoother way to fiddle with the regex matches piece by piece instead of a long string.
library(tidyverse)
data_ <- data.frame(dat = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE"))
data_ %>%
mutate(first_perc = str_extract(dat, "\\d+\\.\\d%?"),
cut_off = str_extract(dat, "[^\\d+\\.\\d% 1ST.]\\d+"),
second_perc = str_extract(dat, "[^\\d+\\.\\d% 1ST?\\d+]?\\d+\\.\\d%"),
bonus = str_extract(dat, "[^\\d+\\.\\d% 1ST?\\d+\\d+\\.\\d%\\D+]?\\d\\d+"))
I might suggest using conditionals to go back for the incomplete rows such as a case_when
argument based on what did or did not get caught by the various mutated rows.
Upvotes: 1
Reputation: 11584
Does this work:
st <- c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE")
dat <- plyr::ldply(str_extract_all(st, '[0-9]\\.[0-9]|[0-9]{4,6}'), rbind)
names(dat) <- c('First%', 'cut-off', 'second%', 'Bonus')
dat <- dat %>% type.convert(as.is = T)
dat$`First%`[dat$`First%`>10] <- NA
dat
First% cut-off second% Bonus
1 3.2 100000 1.1 NA
2 3.3 100000 1.2 3000
3 NA NA NA NA
4 3.3 100000 1.2 NA
5 3.3 100000 1.2 NA
6 3.2 100000 1.1 NA
Upvotes: 1
Reputation: 388817
You can try the following :
library(tidyverse)
df %>%
extract(col, c('First', 'cut-off', 'Second'),
'(\\d+.*?)% 1ST\\s*\\$(\\d+).*?(\\d+.*?)%.*?', remove = FALSE) %>%
mutate(Bonus = str_extract(col, '\\d+(?=\\sBONUS)')) %>%
select(-col)
# First cut-off Second Bonus
#1 3.2 100000 1.1 <NA>
#2 3.3 100000 1.2 3000
#3 <NA> <NA> <NA> <NA>
#4 3.3 100000 1.2 <NA>
#5 3.3 100000 1.2 <NA>
#6 3.2 100000 1.1 <NA>
data
df <- data.frame(col = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE"))
Upvotes: 4