Suhail Gupta
Suhail Gupta

Reputation: 23206

Extract the sub string matching regex

I am trying to extract 22 chocolates from the following string:

   SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila.

using regex \\d+\\s*(chocolates.|chocolate.). I used :

grep("\\d+\\s*(chocolates.|chocolate.)",s)

but it does not give the string 22 chocolates. How could I extract the part that is matching the regex?

Upvotes: 2

Views: 84

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

Your original pattern does not return 22 chocolates because it is a pattern that should be used in a matching function, while grep only returns whole items in a character vector that contain the match anywhere inside.

Also, note that (chocolates.|chocolate.) alternation group can be shortened to chocolates?. since the only difference is the plural case for chocolate and it can easily be achieved with a ? quantifier (=1 or 0 occurrences).

A matching function example can be with stringr::str_extract (str_extract_all to match all occurrences):

> library(stringr)
> x <- " SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila."
> p <- "\\d+\\s*chocolates?"
> str_extract(x, p)
[1] "22 chocolates"

Or a base R regmatches/regexpr (or gregexpr to extract multiple occurrences) approach:

> library(stringr)
> x <- " SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila."
> p <- "\\d+\\s*chocolates?"
> regmatches(x, regexpr(p, x))
[1] "22 chocolates"

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520928

Here is an option using sub from base R:

x <- "SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila."
sub(".*?(\\d+ chocolates?).*", "\\1", x)

22 chocolates

The pattern in parentheses, (\\d+ chocolates?), is a capture group, and is available as \\1 after sub has run on the match.

Demo

Edit:

As you have seen, if sub cannot find an exact match, it will return the input string. This behavior often makes sense, because in a case where a substitution does not make sense, you would want the input to not be changed.

If you need to find out whether or not the pattern matches, then calling grep is one option:

grep(".*(\\d+ chocolates?).*",x,value = FALSE)

Upvotes: 4

Related Questions