user3147534
user3147534

Reputation: 33

Perl regular expression explanation

I have regular expression like this:

 s/<(?:[^>'"]|(['"]).?\1)*>//gs

and I don't know what exactly does it mean.

Upvotes: 0

Views: 90

Answers (2)

Lajos Veres
Lajos Veres

Reputation: 13725

This tool can explain the details: http://rick.measham.id.au/paste/explain.pl?regex=%3C%28%3F%3A[^%3E%27%22]|%28[%27%22]%29.%3F\1%29*%3E

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    [^>'"]                   any character except: '>', ''', '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      ['"]                     any character of: ''', '"'
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    .?                       any character except \n (optional
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    \1                       what was matched by capture \1
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  >                        '>'

So it tries to remove HTML tags as ysth also mentions.

Upvotes: 0

ysth
ysth

Reputation: 98388

The regex looks intended to remove HTML tags from input.

It matches text beginning with < and ending with >, containing non->/non-quotes or quoted strings (which may contain >). But it appears to have an error:

The .? says that quotes may contain 0 or 1 character; it was probably intended to be .*? (0 or more characters). And to prevent backtracking from doing things like making the . match a quote in some odd cases, it needs to change the (?: ... ) grouping to be possessive (> instead of :).

Upvotes: 1

Related Questions