Regex Interpretation

Question

I was trying to filter out non-valid characters from xml. Although I have successfully done it, I wrote a regex during the process that is working counter-intuitive for me.

Please consider the following .Net regex evaluation:

System.Text.RegularExpressions.Regex.Match("Test", @"[\x01-\x08\x0B-\x0C\x0E-\x1F\xD800-\xDFFF\xFFFE-\xFFFF]+").ToString()

Now my understanding is the Regex pattern matches all non-valid xml characters. According to this page: http://www.w3.org/TR/REC-xml/#NT-Char

These are valid characters:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

In my understanding, the regex pattern above is a set difference of remaining Unicode characters (i.e. invalid xml characters). However still running the above statement produces this result:

"Test"

(i.e. the entire input string). I am not able to understand why. In particular, this portion of the regex causes the match: \xD800-\xDFFF

And to me it appears the same is excluded by these 2 groups from valid characters: [#x20-#xD7FF] | [#xE000-#xFFFD]

So I am totally at loss in understanding why a match is produced by the above statement. Can somebody please help me deciphre it.

Kevin Brydon · Accepted Answer

Try using \u instead of \x.

System.Text.RegularExpressions.Regex.Match("Test", @"[\x01-\x08\x0B-\x0C\x0E-\x1F\uD800-\uDFFF\uFFFE-\uFFFF]+").ToString();

The way I understand it is your current regex is matching the string "Test" because it is essentially matching on the following ranges

\x01-\x08
\x0B-\x0C
\x0E-\x1F
\xD8
0
0-\xDF
F
F
\xFF
FE-\xFF
FF

The match 0-\xDF is likely to be the pattern that matches a wide range of characters.

Regex Interpretation

Answers (1)

Related Questions