Reputation: 923
is there any way to extract all numbers in a string as a vector? I have a large dataset which doesn't follow any specific pattern, so using the extract
+ regex
pattern won't necessarily extract all numbers. So for example for each row of data frame shown below:
c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2 - $100000")
[1] "3.2% 1ST $100000 AND 1.1% BALANCE"
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"
[6] "3.2 - $100000"
I want to have an output like:
[1] "3.2 100000 1.1"
[2] "3.3 100000 1.2 3000"
[3] "4000"
[4] "3.3 100000 1.2 "
[5] "3.3 100000 1.2 "
[6] "3.2 100000 "
I had a look at resources and found this link:https://statisticsglobe.com/extract-numbers-from-character-string-vector-in-r
regmatches(x, gregexpr("[[:digit:]]+", x))
it seems that the above function works but it's not capable of doing this task on all sorts of numbers at once. I understand that "[[:digit:]]+"
only look for integer numbers but how we can change this so that it covers all sorts of numbers?
Upvotes: 2
Views: 257
Reputation: 389275
You can use negative lookahead regex :
stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])')
#[[1]]
#[1] "3.2" "100000" "1.1"
#[[2]]
#[1] "3.3" "100000" "1.2" "3000"
#[[3]]
#[1] "4000"
#[[4]]
#[1] "3.3" "100000" "1.2"
#[[5]]
#[1] "3.3" "100000" "1.2"
#[[6]]
#[1] "3.2" "100000"
If you want the output as one string :
sapply(stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])'), paste, collapse = ' ')
#[1] "3.2 100000 1.1" "3.3 100000 1.2 3000" "4000"
#[4] "3.3 100000 1.2" "3.3 100000 1.2" "3.2 100000"
Upvotes: 1
Reputation: 3152
Akrun answer is perfect, but just to add another solution, using a package to create regular expressions patterns that I recently found.
library(stringr)
library(rebus)
library(magrittr)
pattern = one_or_more(DIGIT) %R% optional(DOT) %R% optional(one_or_more(DIGIT))
str_remove(x, "1ST") %>%
str_match_all( pattern = pattern) %>%
lapply( function(x) paste(as.vector(x), collapse = " ")) %>%
unlist()
Upvotes: 3
Reputation: 887901
We need to add the .
also in the matching pattern
sapply(regmatches(x, gregexpr("\\b[[:digit:].]+\\b", x)), paste, collapse= ' ')
#[1] "3.2 100000 1.1"
#[2] "3.3 100000 1.2 3000"
#[3] "4000"
#[4] "3.3 100000 1.2"
#[5] "3.3 100000 1.2"
#[6] "3.2 100000"
Upvotes: 3