takra
takra

Reputation: 467

Perl not matching regex?

I'm trying to remove all the comments in a bunch of SGF files, and have come up with the following perl command:

perl -pi -e 's/P?C\[(?:[^\]\\]++|\\.)*+\]//gm' *.sgf

I'm trying to match and remove a C or PC followed by a left bracket, then characters that aren't right brackets (if they are they have to be escaped with a \) and then a right bracket.

I'm trying to match the following examples:

C[HelloBot9 [-\]: GTP Engine for HelloBot9 (white): HelloBot version 0.6.26.08]

PC[IA [-\]: GTP Engine for IA (black): GNU Go version 3.7.11
]

C[person [-\]: \\\]]

C[AyaMC [3k\]: GTP Engine for AyaMC (black): Aya version 6.61 : If you pass, AyaMC 
will pass. When AyaMC does not, please remove all dead stones.]

And some examples that shouldn't be matched:

XYZ[Other stuff \]]

C[stuff\]

PC[stuff\\\]

The regex works in several online regex testers (including a few that state they are perl regex testers), but for some reason doesn't work on the command line. Help is appreciated.

Upvotes: 3

Views: 546

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626806

You need to run perl with -0777 option to make sure that contents spanning across lines and matching the pattern can be found. So, using perl -0777pi -e instead of perl -pi -e will solve the issue.

I would also suggest optimizing the pattern a bit by unrolling the alternation group, thus, making matching process "linear":

s/P?C\[[^]\\]*(?:\\.[^]\\]*+)*]//sg

Note that if PC should be matched as a whole word, add \b before P.

Pattern details:

  • P?C\[ - either PC[ or C[ literal char sequence
  • [^]\\]* - zero or more chars other than \ and ]
  • (?:\\.[^]\\]*+)* - zero or more sequences of:
    • \\. - a literal \ and then any char (.)
    • [^]\\]*+ - 0+ chars other than ] and \ (matched possessively, no backtracking into the pattern)
  • ] - a literal ] symbol (note it does not have to be escaped outside the character class to denote a literal closing bracket)

Upvotes: 2

Related Questions