Regular expressions in R that apply only from nth word on

Question

In R regular expressions, how could one have a regex evaluated not from the beginning of the target strings, but instead only from the nth word on?

For example, suppose one is interested in substituting any numbers in a string with the symbol @. Then one could use gsub("\d+", "@", string), like in:

gsub("\d+", "@", "words before 879 then more words then 1001 again")

The result would be:

> "words before @ then more words then @ again"

Now, keeping up with that example, using regex, how could one make it so that only numbers that appear starting from the 4th word in the string would get substituted? So that the above example would return "words before 879 then more words then @ again" because 879 is the 3rd word in the target string?

FWIW, I found many questions on extracting and locating words, some on matching from the beginning versus from the end, some on getting the substring up to or from the nth word. But none on how exactly to only have regular expressions disregard the first n words of a string when looking for a pattern.

Ryszard Czech · Accepted Answer

Use this with perl=TRUE (double backslashes in R):

^\s*(?:\S+\s*){3}(*SKIP)(*FAIL)|\d+

See proof.

EXPLANATION

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  \s*                      whitespace (
, 
, 	, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (3 times):
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but 
, 
, 	, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
    \s*                      whitespace (
, 
, 	, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  ){3}                     end of grouping
--------------------------------------------------------------------------------
  (*SKIP)(*FAIL)           skip the match, search for next match
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))

Example code sample:

gsub("^\s*(?:\S+\s*){3}(*SKIP)(*FAIL)|\d+", "@", "words before 879 then more words then 1001 again", perl=TRUE)

Regular expressions in R that apply only from nth word on

Answers (2)

data

Related Questions