learnr
learnr

Reputation: 6649

More than 9 backreferences in gsub()

How to use gsub with more than 9 backreferences? I would expect the output in the example below to be "e, g, i, j, o".

> test <- "abcdefghijklmnop"
> gsub("(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)(\\w)", "\\5, \\7, \\9, \\10, \\15", test, perl = TRUE)
[1] "e, g, i, a0, a5"

Upvotes: 12

Views: 4053

Answers (6)

mjv
mjv

Reputation: 75205

This limitation to 9 backreferences is specific to the sub() and gsub()functions, not to functions like grep() and the like. Support for more than 9 backreferences in R implies using PCRE regular expression (i.e. the perl=TRUE argument); however, even with this option, the sub() and gsub() functions do not support it.

The R documentation is explicit on this point: see ?regexp

There can be more than 9 backreferences (but the replacement in sub can
only refer to the first 9).

Furthermore the idea of using named capture groups to circumvent this limitation is bound to fail since named capture groups are not supported with sub() functions.

regexpr and gregexpr support ‘named capture’. If groups are named,
e.g., "(?<first>[A-Z][a-z]+)" then the positions of the matches are also
returned by name. (Named backreferences are not supported by sub.)

Upvotes: 0

gagolews
gagolews

Reputation: 13056

The stri_replace_*_regex functions from the stringi package do not have such limitations:

library("stringi")
stri_replace_all_regex("abcdefghijkl", "(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)", "$10$1$11$12")
## [1] "jakl"

If you want to follow the 1st capture group with 1, use e.g.

stri_replace_all_regex("abcdefghijkl", "(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)", "$10$1$1\\1$12")
## [1] "jaa1l"

Upvotes: 2

hadley
hadley

Reputation: 103908

Use strsplit instead:

test <- "abcdefghijklmnop"
strsplit(test, "")[[1]][c(5, 7, 9, 10, 15)]

Upvotes: 4

Gumbo
Gumbo

Reputation: 655319

See Regular Expressions with The R Language:

You can use the backreferences \1 through \9 in the replacement text to reinsert text matched by a capturing group. There is no replacement text token for the overall match. Place the entire regex in a capturing group and then use \1.

But with PCRE you should be able to use named groups. So try (?P<name>regex) for groupd naming and (?P=name) as backreference.

Upvotes: 10

Rich Seller
Rich Seller

Reputation: 84038

According to this site, back references \10 to \99 works on some languages, but not most.

Those that are reported to work are

Upvotes: 1

easement
easement

Reputation: 6139

It was my understanding that \10 would we understood as backreference 0 followed by a digit of 1. I think 9 is the max.

Upvotes: 3

Related Questions