Stefania Axo
Stefania Axo

Reputation: 127

R substring based on Regular Expression

I have a strings like :

myString = "2 word1 & 4 word2"
myString = "4 word2"
myString = "2 word1"

I would like to get the number before the word1 and the number before word2

number1 = 2
number2 = 4

How can i do with a regular expression in R

I tried something like this but it only get the first number

 gsub("([0-9]+).*", "\\1", myString)

Upvotes: 2

Views: 79

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269451

This removes each occurrence of a letter or ampersand possibly followed by other non-space characters and then scans in what is left. The scan also converts them to numeric. No packages are used.

myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")

lapply(myString, function(x) scan(text = gsub("[[:alpha:]&]\\S*", "", x), quiet = TRUE))

giving:

[[1]]
[1] 2 4

[[2]]
[1] 4

[[3]]
[1] 2

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

You may extract specific number before a specific string using a regex with a lookahead:

> word1_res <- str_extract_all(myString, "\\d+(?=\\s*word1)")
> word1_res
[[1]]
[1] "2"

[[2]]
character(0)

[[3]]
[1] "2"

The results for word2 can be retrieved similarly:

word2_res <- str_extract_all(myString, "\\d+(?=\\s*word2)")

Details

  • \d+ - 1 or more digits...
  • (?=\\s*word2) - if immediately followed with:
    • \s* - 0+ whitespaces
    • word2 - a literal word2 substring.

A base R equivalent is

regmatches(myString, gregexpr("\\d+(?=\\s*word1)", myString, perl=TRUE))
regmatches(myString, gregexpr("\\d+(?=\\s*word2)", myString, perl=TRUE))

A sub almost equivalent solution would be

> sub(".*?(\\d+)\\s*word1.*|.*","\\1",myString)
[1] "2" ""  "2"
> sub(".*?(\\d+)\\s*word2.*|.*","\\1",myString)
[1] "4" "4" "" 

Note that this implies there is only one result per string, while str_extract_all will get all occurrences from the string.

To extract any chunk of 1+ digits as a whole word using a stringr solution with str_extract_all

library(stringr)
str_extract_all(myString, "\\b\\d+\\b")

or a base R one with regmatches/gregexpr:

myString <- c("2 word1 & 4 word2", "4 word2", "2 word1")
regmatches(myString, gregexpr("\\b\\d+\\b", myString))

See an online R demo. Output:

[[1]]
[1] "2" "4"

[[2]]
[1] "4"

[[3]]
[1] "2"

Details

  • \b - a word boundary
  • \d+ - 1 or more digits
  • \b - a word boundary.

Upvotes: 3

Gerard H. Pille
Gerard H. Pille

Reputation: 2578

try

myString = "2 word1 & 4 word2"
number1 = gsub("([0-9]+).*", "\\1", myString)
myString = "4 word2"
number2 = gsub("([0-9]+).*", "\\1", myString)
myString = "2 word1"
number3 = gsub("([0-9]+).*", "\\1", myString)
print(number1)
print(number2)
print(number3)

If you assign 3 times a string to myString, myString will only contain the last one.

Upvotes: 1

Related Questions