Michael
Michael

Reputation: 93

How to extract a substring by inverse pattern with R?

I trying to extract a substring by pattern using gsub() R function.

# Example: extracting "7 years" substring.
string <- "Psychologist - 7 years on the website, online"
gsub(pattern="[0-9]+\\s+\\w+", replacement="", string)`

`[1] "Psychologist -  on the website, online"

As you can see, it's easy to exlude needed substring using gsub(), but I need to inverse the result and getting "7 years" only. I think about using "^", something like that:

gsub(pattern="[^[0-9]+\\s+\\w+]", replacement="", string)

Please, could anyone help me with correct regexp pattern?

Upvotes: 6

Views: 1605

Answers (2)

Jan
Jan

Reputation: 43179

You could use the opposite of \d, which is \D in R:

string <- "Psychologist - 7 years on the website, online"
sub(pattern = "\\D*(\\d+\\s+\\w+).*", replacement = "\\1", string)
# [1] "7 years"

\D* means: no digits as long as possible, the rest is captured in a group and then replaces the complete string.

See a demo on regex101.com.

Upvotes: 4

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627219

You may use

sub(pattern=".*?([0-9]+\\s+\\w+).*", replacement="\\1", string)

See this R demo.

Details

  • .*? - any 0+ chars, as few as possible
  • ([0-9]+\\s+\\w+) - Capturing group 1:
    • [0-9]+ - one or more digits
    • \\s+ - 1 or more whitespaces
    • \\w+ - 1 or more word chars
  • .* - the rest of the string (any 0+ chars, as many as possible)

The \1 in the replacement replaces with the contents of Group 1.

Upvotes: 7

Related Questions