soosus
soosus

Reputation: 1217

remove all characters between string and bracket in R

Say I have a dataframe df in which a column df$strings contains strings like

[cat 00.04;09]
[cat 00.04;10]

and so on. I want to remove all characters between "[cat" and "]" to yield

[cat]
[cat]

I've tried this using gsub but it's not working and I'm not sure what I'm doing wrong:

gsub('cat*?\\]', '', df)

Upvotes: 1

Views: 480

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

Note that cat*?\\] patten matches ca, then any 0+ t chars but as few as possible and then ].

You want to match any chars other than ] between [cat and ]:

gsub('\\[cat[^]]*\\]', '[cat]', df$strings)

Here,

  • \\[ - matches [
  • cat - matches cat
  • [^]]* - 0+ chars other than ] (note that ] inside the bracket expression should not be escaped when placed at the start - else, if you escape it, you will need to add perl=TRUE argument since PCRE regex engine can handle regex escapes inside bracket expressions (not the default TRE))
  • \\] - a ] (you do not even need to escape it, you may just use ]).

See the R demo:

x <- c("[cat 00.04;09]", "[cat 00.04;10]")
gsub('\\[cat[^]]*\\]', '[cat]', x)
## => [1] "[cat]" "[cat]"

If cat can be any word, use

gsub('\\[(\\w+)[^]]*\\]', '[\\1]', x)

where (\\w+) is a capturing group with ID=1 that matches 1 or more word chars, and \\1 in the replacement pattern is a replacement backreference that stands for the group value.

Upvotes: 4

Related Questions