Is there a simple way to get substring in R?

Question

i get the substring of word in the following way:

 word="xyz9874"
 pattern="[0-9]+"
 x=gregexpr(pattern,word)
 substr(word,start=x[[1]],stop=x[[1]]+attr(x[[1]],"match.length")-1)
[1] "9874"

Is there a more simple way to get the result in R?

January · Accepted Answer

Sure, use gsub and backreferencing:

gsub( ".*?([0-9]+).*", "\1", word )

Explanation: in most regex implementations, \1 is the back reference to the first subpattern matched. The subpattern is enclosed in parentheses. In R, you need to escape the backslash irrespective of the type of quotation marks you are using.

The question mark, an idiom of the "extended" regular expressions means that the given regex pattern should not be greedy, in other words -- it should take as little of the string as possible. Othrewise, the .* in the pattern .*([0-9]+) would match xyz987 and ([0-9]+) would match 4. Alternatively, we can write

gsub( ".*[^0-9]+([0-9]+).*", "\1", word )

but then we have a problem with strings that start with a number.

By the way, note that instead of [0-9] you can write \d, or, actually, \d:

gsub( ".*?(\d+).*", "\1", word )

Is there a simple way to get substring in R?

Answers (1)

Related Questions