Reputation: 21
I have a string and would like to extract the first sets of three numbers and any three letters next to each number and then put into a vector. So this:
t1 <- "The string contains numbers ranging from 3-4 cm and can reach up to 5.6 m long, and sometimes can even reach 10 m."
t1 would become:
"3-4 cm", "5.6 m", "10m"
I have looked up various regular expression functions like grep, grepl etc., but can't find example that matches my query. Any suggestions?
Upvotes: 0
Views: 420
Reputation: 35314
Here's how this can be done with gregexpr()
+regmatches()
:
ipartRE <- '\\d+';
fpartRE <- '\\.\\d+';
numRE <- paste0(ipartRE,'(?:',fpartRE,')?');
rangeRE <- paste0(numRE,'(?:\\s*-\\s*',numRE,')?');
pat <- paste0(rangeRE,'\\s*[a-zA-Z]{1,3}\\b');
regmatches(t1,gregexpr(perl=T,pat,t1))[[1L]];
## [1] "3-4 cm" "5.6 m" "10 m"
I built up the regex incrementally from component parts for human readability, but obviously you don't need to do that.
To match the new pattern we need to accept an alternation for the second number which takes matching parentheses around the number. I also found that the dash in 120(–150) cm
in not a normal ASCII hyphen, but is an en dash, and so I added another precomputed regular expression piece called dashRE
which matches all 3 common dash types (ASCII, en dash, and em dash):
ipartRE <- '\\d+';
fpartRE <- '\\.\\d+';
numRE <- paste0(ipartRE,'(?:',fpartRE,')?');
dashRE <- '[—–-]';
rangeOptParenRE <- paste0(numRE,'(?:\\s*(?:',dashRE,'\\s*',numRE,'|\\(\\s*',dashRE,'\\s*',numRE,'\\s*\\)\\s*))?');
pat <- paste0(rangeOptParenRE,'\\s*[a-zA-Z]{1,3}\\b');
regmatches(t1,gregexpr(perl=T,pat,t1))[[1L]];
## [1] "3-4 cm" "120(–150) cm" "5.6 m" "10 m"
Upvotes: 1
Reputation: 214957
You can try this regular expression [0-9.-]+\\s+[a-zA-z]{1,3}
and use the str_extract_all
from stringr
package to extract them:
stringr::str_extract_all(t1, "[0-9.-]+\\s+[a-zA-Z]{1,3}")
[[1]]
[1] "3-4 cm" "5.6 m" "10 m"
Upvotes: 1