Reputation: 127
I have a strings like :
myString = "2 word1 & 4 word2"
myString = "4 word2"
myString = "2 word1"
I would like to get the number before the word1 and the number before word2
number1 = 2
number2 = 4
How can i do with a regular expression in R
I tried something like this but it only get the first number
gsub("([0-9]+).*", "\\1", myString)
Upvotes: 2
Views: 79
Reputation: 269451
This removes each occurrence of a letter or ampersand possibly followed by other non-space characters and then scans in what is left. The scan also converts them to numeric. No packages are used.
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
lapply(myString, function(x) scan(text = gsub("[[:alpha:]&]\\S*", "", x), quiet = TRUE))
giving:
[[1]]
[1] 2 4
[[2]]
[1] 4
[[3]]
[1] 2
Upvotes: 0
Reputation: 626690
You may extract specific number before a specific string using a regex with a lookahead:
> word1_res <- str_extract_all(myString, "\\d+(?=\\s*word1)")
> word1_res
[[1]]
[1] "2"
[[2]]
character(0)
[[3]]
[1] "2"
The results for word2
can be retrieved similarly:
word2_res <- str_extract_all(myString, "\\d+(?=\\s*word2)")
Details
\d+
- 1 or more digits...(?=\\s*word2)
- if immediately followed with:
\s*
- 0+ whitespacesword2
- a literal word2
substring.A base R equivalent is
regmatches(myString, gregexpr("\\d+(?=\\s*word1)", myString, perl=TRUE))
regmatches(myString, gregexpr("\\d+(?=\\s*word2)", myString, perl=TRUE))
A sub
almost equivalent solution would be
> sub(".*?(\\d+)\\s*word1.*|.*","\\1",myString)
[1] "2" "" "2"
> sub(".*?(\\d+)\\s*word2.*|.*","\\1",myString)
[1] "4" "4" ""
Note that this implies there is only one result per string, while str_extract_all
will get all occurrences from the string.
To extract any chunk of 1+ digits as a whole word using a stringr
solution with str_extract_all
library(stringr)
str_extract_all(myString, "\\b\\d+\\b")
or a base R one with regmatches
/gregexpr
:
myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
regmatches(myString, gregexpr("\\b\\d+\\b", myString))
See an online R demo. Output:
[[1]]
[1] "2" "4"
[[2]]
[1] "4"
[[3]]
[1] "2"
Details
\b
- a word boundary\d+
- 1 or more digits\b
- a word boundary.Upvotes: 3
Reputation: 2578
try
myString = "2 word1 & 4 word2"
number1 = gsub("([0-9]+).*", "\\1", myString)
myString = "4 word2"
number2 = gsub("([0-9]+).*", "\\1", myString)
myString = "2 word1"
number3 = gsub("([0-9]+).*", "\\1", myString)
print(number1)
print(number2)
print(number3)
If you assign 3 times a string to myString, myString will only contain the last one.
Upvotes: 1