Yellow_truffle
Yellow_truffle

Reputation: 923

How to extract all numbers in a string as a vector

is there any way to extract all numbers in a string as a vector? I have a large dataset which doesn't follow any specific pattern, so using the extract + regex pattern won't necessarily extract all numbers. So for example for each row of data frame shown below:

c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
"3.2 - $100000")

[1] "3.2% 1ST $100000 AND 1.1% BALANCE"                                
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"                                                            
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[6] "3.2 - $100000"   

I want to have an output like:

[1] "3.2 100000 1.1"                                
[2] "3.3 100000 1.2 3000"
[3] "4000"                                                            
[4] "3.3 100000 1.2 "                                
[5] "3.3 100000 1.2 "                                
[6] "3.2 100000 "   

I had a look at resources and found this link:https://statisticsglobe.com/extract-numbers-from-character-string-vector-in-r

regmatches(x, gregexpr("[[:digit:]]+", x))

it seems that the above function works but it's not capable of doing this task on all sorts of numbers at once. I understand that "[[:digit:]]+" only look for integer numbers but how we can change this so that it covers all sorts of numbers?

Upvotes: 2

Views: 257

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 389275

You can use negative lookahead regex :

stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])')

#[[1]]
#[1] "3.2"    "100000" "1.1"   

#[[2]]
#[1] "3.3"    "100000" "1.2"    "3000"  

#[[3]]
#[1] "4000"

#[[4]]
#[1] "3.3"    "100000" "1.2"   

#[[5]]
#[1] "3.3"    "100000" "1.2"   

#[[6]]
#[1] "3.2"    "100000"

If you want the output as one string :

sapply(stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])'), paste, collapse = ' ')
#[1] "3.2 100000 1.1"      "3.3 100000 1.2 3000" "4000"               
#[4] "3.3 100000 1.2"      "3.3 100000 1.2"      "3.2 100000"  

Upvotes: 1

Johan Rosa
Johan Rosa

Reputation: 3152

Akrun answer is perfect, but just to add another solution, using a package to create regular expressions patterns that I recently found.

library(stringr)
library(rebus)
library(magrittr)

pattern = one_or_more(DIGIT) %R% optional(DOT) %R% optional(one_or_more(DIGIT))

str_remove(x, "1ST") %>% 
str_match_all( pattern = pattern) %>% 
  lapply( function(x) paste(as.vector(x), collapse = " ")) %>% 
  unlist()

Upvotes: 3

akrun
akrun

Reputation: 887901

We need to add the . also in the matching pattern

sapply(regmatches(x, gregexpr("\\b[[:digit:].]+\\b", x)), paste, collapse= ' ')
#[1] "3.2 100000 1.1"    
#[2] "3.3 100000 1.2 3000" 
#[3] "4000"              
#[4] "3.3 100000 1.2"   
#[5] "3.3 100000 1.2"     
#[6] "3.2 100000"   

Upvotes: 3

Related Questions