Reputation: 21

regex to find incomplete xml tags in c#

I'm trying to use regular expression to find incomplete xml tags that have no attributes. So far, I've managed to come up with this regex </?\s*([a-zA-Z0-9]?:\s+)?[a-zA-Z0-9]*(?!>), but that doesn't do the trick. In an xml like this one: <abc> </abc> <ab> </ab <s:ab

I want to match </ab and <s:ab (as they're both lacking ">" at the end). Is there a way to do this using regular expressions in c#?

Upvotes: 1

Answers (3)

FrankieTheKneeMan

Reputation: 6800

As people have said, this is probably a fruitless endeavor - as XML is not a regular language. However, part of your problem is your lookahead. You only ensure that it's not immediately followed by a closing angle bracket - which means things like <ab of <abc> will match even when you don't want them too. so you need to include the entire tag structure in your lookahead.

To get a match for the exact data you gave, I could use the regular expression:

#</?([a-z]?:)?[a-z]*(?!/?([a-z]?:)?[a-z]*>)#

Which you can see in action here. The key here is to make sure that at no point can the regular expressions engine backtrack (by say, dropping one character) to validate the lookahead. There are other ways to do this - such as possessive quantifiers, which refuse to give up their matched token in a normal backtracking process, but the standard .NET engine doesn't support possessive matching. It does support an atomic group - which behaves the same way, but using a group instead of a quantifier. You can see here that I've wrapped the entire opening of the tag in an atomic group. ((?> ... ))

#(?></?([a-z]?:)?[a-z]*)(?!>)#

You're free to enter your own regular expression for how a tag ought to be formatted, but I must say that this regular expression is already pushing the limits for readable code, and messing about with legal xml tag names is going to push it further in that direction. Nevertheless, I hope this has helped shed some light on the error.

Upvotes: 0

Qtax

Reputation: 33908

You are pretty close. Your major problem is that the pattern backtracks when the negative lookahead fails. You can avoid that by putting the part before the lookahead in an non-backtracking atomic group: (?>no backtracking in here).

For example:

(?xi)                   # turn on eXtended (ignore spaces/comments) and case-Insensitive mode
(?>                     # don't backtrack
  < /?                  # tag start (no space allowed after it)
  [a-z0-9]+             # tag name/space
  (?: : [a-z0-9]+ )?
  \s*                   # optional spaces
)
(?! > )                 # no ending

Note that this will match <foo in <foo bar>.

Upvotes: 1

iseeall

Reputation: 3431

If you are just trying to find errors in a single xml file, try opening it in Google Chrome web browser - it will show the line where the error is.

But if you have lot's of files you have to process in code, then you'd need something more powerful than regexes.

Upvotes: 0

regex to find incomplete xml tags in c#

Answers (3)

Related Questions