Patrick McCarthy
Patrick McCarthy

Reputation: 2538

Regex for rectangle brackets in R

Conventionally in R one can use metacharacters in a regex with two slashes, e.g. ( becomes \(, but I find the same isn't true for square brackets.

mystring <- "abc[de"

#remove [,] and $ characters

gsub("[\\[\\]$]","",mystring)

[1] "abc[de"

[[:punct:]] works but I hate to use a non-standard regex if I don't have to. Can the regex set syntax be used?

Upvotes: 6

Views: 578

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627083

You should enable perl = TRUE, then you can use Perl-like syntax which is more straight-forward (IMHO):

gsub("[\\[\\]$]","",mystring, perl = TRUE)

Or, you may use "smart placement" when placing ] at the start of the bracket expression ([ is not special inside it, there is no need escaping [ there):

gsub("[][$]","",mystring)

See demo

Result:

[1] "abcde"

More details

The [...] construct is considered a bracket expression by the TRE regex engine (used by default in base R regex functions - (g)sub, grep(l), (g)regexpr - when used without perl=TRUE), which is a POSIX regex construct. Bracket expressions, unlike character classes in NFA regex engines, do not support escape sequences, i.e. the \ char is treated as a a literal backslash char inside them.

Thus, the [\[\]] in a TRE regex matches \ or [ char (with the [\[\] part that is actually equal to [\[]) and then a ]. So, it matches \] or [] substrings, just have a look at gsub("[\\[\\]]", "", "[]\\]ab]") demo - it outputs ab] because [] and \] are matched and eventually removed.

Note that the terms POSIX bracket expressions and NFA character classes are used in the same meaning as is used at https://www.regular-expressions.info, it is not quite a standard, but there is a need to differentiate between the two.

Upvotes: 4

Greg Snow
Greg Snow

Reputation: 49650

You can just use \\[ as the thing to match, you don't need additional square brackets unless you are matching multiple options:

> mystring <- 'abc[de'
> gsub("\\[", "", mystring)
[1] "abcde"

You can make this even simpler and faster for single characters by taking away the special meaning using fixed=TRUE:

> mystring <- 'abc[de'
> gsub("[", "", mystring, fixed=TRUE)
[1] "abcde"

Or if the first thing inside of square brackets is square brackets (unescaped), then they are taken as the literal character rather than having the usual special meaning:

> mystring <- 'a,bc[d]e$'
> gsub("[][,$]", "", mystring)
[1] "abcde"

Upvotes: 1

Frank
Frank

Reputation: 66819

I would sidestep [ab] syntax and use (a|b). Besides working, it may also be more readable:

gsub("(\\[|\\]|\\$)","",mystring)

Upvotes: 4

Related Questions