BHD
BHD

Reputation: 21

Extracting numbers and text from string in R

I have a string and would like to extract the first sets of three numbers and any three letters next to each number and then put into a vector. So this:

t1 <- "The string contains numbers ranging from 3-4 cm and can reach up to 5.6 m long, and sometimes can even reach 10 m."

t1 would become:

"3-4 cm", "5.6 m", "10m"

I have looked up various regular expression functions like grep, grepl etc., but can't find example that matches my query. Any suggestions?

Upvotes: 0

Views: 420

Answers (2)

bgoldst
bgoldst

Reputation: 35314

Here's how this can be done with gregexpr()+regmatches():

ipartRE <- '\\d+';
fpartRE <- '\\.\\d+';
numRE <- paste0(ipartRE,'(?:',fpartRE,')?');
rangeRE <- paste0(numRE,'(?:\\s*-\\s*',numRE,')?');
pat <- paste0(rangeRE,'\\s*[a-zA-Z]{1,3}\\b');
regmatches(t1,gregexpr(perl=T,pat,t1))[[1L]];
## [1] "3-4 cm" "5.6 m"  "10 m"

I built up the regex incrementally from component parts for human readability, but obviously you don't need to do that.


To match the new pattern we need to accept an alternation for the second number which takes matching parentheses around the number. I also found that the dash in 120(–150) cm in not a normal ASCII hyphen, but is an en dash, and so I added another precomputed regular expression piece called dashRE which matches all 3 common dash types (ASCII, en dash, and em dash):

ipartRE <- '\\d+';
fpartRE <- '\\.\\d+';
numRE <- paste0(ipartRE,'(?:',fpartRE,')?');
dashRE <- '[—–-]';
rangeOptParenRE <- paste0(numRE,'(?:\\s*(?:',dashRE,'\\s*',numRE,'|\\(\\s*',dashRE,'\\s*',numRE,'\\s*\\)\\s*))?');
pat <- paste0(rangeOptParenRE,'\\s*[a-zA-Z]{1,3}\\b');
regmatches(t1,gregexpr(perl=T,pat,t1))[[1L]];
## [1] "3-4 cm"       "120(–150) cm" "5.6 m"        "10 m"

Upvotes: 1

akuiper
akuiper

Reputation: 214957

You can try this regular expression [0-9.-]+\\s+[a-zA-z]{1,3} and use the str_extract_all from stringr package to extract them:

stringr::str_extract_all(t1, "[0-9.-]+\\s+[a-zA-Z]{1,3}")
[[1]]
[1] "3-4 cm" "5.6 m"  "10 m"

Upvotes: 1

Related Questions