Reputation: 173
I'm trying to write an expression that extracts numbers from a string with corresponding currency signs and potential amount abbreviations (m or k):
text <- "$10000 and $10,000 and $5m and $50m and $50.2m and $50,2m"
str_extract(text, "\\$(\\d+)[a-z]+") # solution_1
str_extract(text, "\\$(\\d+)+") #solution_2
Desired output:
"$10000 $10,000 $5m $50m $50.2m $50,2m"
The problem is that solution_1
extracts only "$5m" and solution_2
only "$10000".
UPDATE: @Tim Biegeleisen provided a great solution. I am also trying to get rid of a period in the end, e.g. $50m. and...
to get $50m
.
text <- "$5, $10,000, and $5m, and $50m. and $50.2m and $50,2m"
m <- gregexpr("\\$[0-9.,]+?[mbt]?(?=(?:, | |$))", text, perl=TRUE)
regmatches(text, m)
Upvotes: 0
Views: 308
Reputation: 887391
May be we could use gsub
as the OP's expected output showed as a single string
gsub("\\b[A-Za-z]+,?|[,.](\\s)", "\\1", text)
#[1] "$10000 $10,000 $5m $50m $50.2m $50,2m"
#[2] "$5 $10,000 $5m $50m $50.2m $50,2m"
text <- c( "$10000 and $10,000 and $5m and $50m and $50.2m and $50,2m",
"$5, $10,000, and $5m, and $50m. and $50.2m and $50,2m")
Upvotes: 0
Reputation: 5893
Could also do it e.g. this way
txt = unlist(strsplit(text, split = " "))
txt[grep("\\$\\d+((,|\\.)?)(\\d*)?(m)?", txt)]
[1] "$10000" "$10,000" "$5m" "$50m" "$50.2m" "$50,2m"
Upvotes: 0
Reputation: 521804
Try using grepexpr
with regmatches
:
text <- "$10000 and $10,000 and $5m and $50m and $50.2m and $50,2m"
m <- gregexpr("\\$[0-9.,]+[mbt]?", text)
regmatches(text, m)
[[1]]
[1] "$10000" "$10,000" "$5m" "$50m" "$50.2m" "$50,2m"
I am assuming that only numbers, comma, and decimal point, would compose a given amount string. I also assume that the amount might end in m
, b
, or t
(for million, billion, trillion).
Upvotes: 3