Pirvu Paul Daniel
Pirvu Paul Daniel

Reputation: 71

Match double hyphens in comments of malformed XML

I'm to parse XML files that do not conform to the "no double hyphens in comments" -standard, which makes MSXML complain. I am looking for a way of deleting offending hyphens.

I am using StringRegExpReplace(). I attempted following regular expressions:

<!--(.*)--> : correctly gets comments
<!--(-*)--> : fails to be a correct regex (also tried escaping and using \x2D)

Given the right pattern, I would call:

StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing

How to match remaining extra hyphens within an XML comment, while leaving the remaining text alone?

Upvotes: 5

Views: 618

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

You can use this pattern:

(?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))

details:

(?| 
    \G(?!\A) # contiguous to the precedent match (inside a comment)

    (?|
        -{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
      |
         (-[^-]+)          # preserve isolated hyphens 
      |
         -+ (?=-->)        # hyphens before closing sequence, break contiguity
      |
         -->[^<]*          # closing sequence, go to next <
         (*SKIP)(*FAIL)    # break contiguity
    )
  |
    [^<]*<+ # reach the next < (outside comment)
    (?> [^<]+ <+ )*?       # next < until !-- or the end of the string 
    (?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string
    (?|
        -*+ ([^->][^-]*)   # possible hyphens not followed by >
      |
        -+ (?=-->)         # hyphens before closing sequence, break contiguity
      |
        -?+ ([^-]+)        # one hyphen followed by >
      |
        -->[^<]*           # closing sequence, go to next <
        (*SKIP)(*FAIL) ()  # break contiguity (note: "()" avoids a mysterious bug
    )                      # in regex101, you can remove it)
)

With this replacement: \1

online demo

The \G feature ensures that matches are consecutive. Two ways are used to break the contiguity:

  • a lookahead (?=-->)
  • the backtracking control verbs (*SKIP)(*FAIL) that forces the pattern to fail and all characters matched before to not be retried.

So when contiguity is broken or at the begining the first main branch will fail (cause of the \G anchor) and the second branch will be used.

\K removes all on the left from the match result.

(*ACCEPT) makes the pattern succeed unconditionnaly.

This pattern uses massively the branch reset feature (?|...(..)...|...(..)...|...), so all capturing groups have the same number (in other words there is only one group, the group 1.)

Note: even this pattern is long, it needs few steps to obtain a match. The impact of non-greedy quantifiers is reduced as much as possible, and each alternatives are sorted and as efficient as possible. One of the goals is to reduce the total number of matches needed to treat a string.

Upvotes: 4

Tim Pietzcker
Tim Pietzcker

Reputation: 336098

(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)

matches -- (or ---- etc.) only between <!-- and -->. You need to set the /s parameter to allow the dot to match newlines.

Explanation:

(?<!<!)   # Assert that we're not right at the start of a comment
--+       # Match two or more dashes --
(?=       # only if the following can be matched further onwards:
 (?!-?>)  # First, make sure we're not at the end of the comment.
 (?:      # Then match the following group
  (?!-->) # which must not contain -->
  .       # but may contain any character
 )*       # any number of times
 -->      # as long as --> follows.
)         # End of lookahead assertion.

Test it live on regex101.com.

I suppose the correct AutoIt syntax would be

StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "")

Upvotes: 3

Related Questions