Reputation: 33
In R regular expressions, how could one have a regex evaluated not from the beginning of the target strings, but instead only from the nth word on?
For example, suppose one is interested in substituting any numbers in a string with the symbol
@
. Then one could use gsub("\\d+", "@", string)
, like in:
gsub("\\d+", "@", "words before 879 then more words then 1001 again")
The result would be:
> "words before @ then more words then @ again"
Now, keeping up with that example, using regex, how could one make it so that only numbers that appear starting from the 4th word in the string would get substituted? So that the above example would return "words before 879 then more words then @ again"
because 879
is the 3rd word in the target string?
FWIW, I found many questions on extracting and locating words, some on matching from the beginning versus from the end, some on getting the substring up to or from the nth word. But none on how exactly to only have regular expressions disregard the first n words of a string when looking for a pattern.
Upvotes: 3
Views: 99
Reputation: 887118
We could create a proto
function in gsubfn
to count the words and replace
library(gsubfn)
gsubfn("\\w+", proto(fun = function(this, x) if(count > 3)
sub("\\d+", "@", x) else x), str1)
#[1] "words before 879 then more words then @ again"
One of the advantage is that it can insert/replace at any word count or can have replacements at multiple word counts i.e. suppose we want to replace only words between 4 and 6
gsubfn("\\w+", proto(fun = function(this, x) if(count %in% 4:6)
sub("\\d+", "@", x) else x), str1)
or a more complex case
gsubfn("\\w+", proto(fun = function(this, x) if(count %in% c(4:6, 12:15))
sub("\\d+", "@", x) else x), str2)
#[1] "words before 879 then @ replace not 1001 again and replace @ and @"
str1 <- "words before 879 then more words then 1001 again"
str2 <- "words before 879 then 50 replace not 1001 again and replace 1003 and 1005"
Upvotes: 4
Reputation: 18611
Use this with perl=TRUE
(double backslashes in R):
^\s*(?:\S+\s*){3}(*SKIP)(*FAIL)|\d+
See proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (3 times):
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
){3} end of grouping
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next match
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
Example code sample:
gsub("^\\s*(?:\\S+\\s*){3}(*SKIP)(*FAIL)|\\d+", "@", "words before 879 then more words then 1001 again", perl=TRUE)
Upvotes: 4