Reputation: 893
I was trying to filter out non-valid characters from xml. Although I have successfully done it, I wrote a regex during the process that is working counter-intuitive for me.
Please consider the following .Net regex evaluation:
System.Text.RegularExpressions.Regex.Match("Test", @"[\x01-\x08\x0B-\x0C\x0E-\x1F\xD800-\xDFFF\xFFFE-\xFFFF]+").ToString()
Now my understanding is the Regex pattern matches all non-valid xml characters. According to this page: http://www.w3.org/TR/REC-xml/#NT-Char
These are valid characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
In my understanding, the regex pattern above is a set difference of remaining Unicode characters (i.e. invalid xml characters). However still running the above statement produces this result:
"Test"
(i.e. the entire input string). I am not able to understand why. In particular, this portion of the regex causes the match: \xD800-\xDFFF
And to me it appears the same is excluded by these 2 groups from valid characters: [#x20-#xD7FF] | [#xE000-#xFFFD]
So I am totally at loss in understanding why a match is produced by the above statement. Can somebody please help me deciphre it.
Upvotes: 1
Views: 556
Reputation: 13102
Try using \u
instead of \x
.
System.Text.RegularExpressions.Regex.Match("Test", @"[\x01-\x08\x0B-\x0C\x0E-\x1F\uD800-\uDFFF\uFFFE-\uFFFF]+").ToString();
The way I understand it is your current regex is matching the string "Test" because it is essentially matching on the following ranges
\x01-\x08
\x0B-\x0C
\x0E-\x1F
\xD8
0
0-\xDF
F
F
\xFF
FE-\xFF
FF
The match 0-\xDF
is likely to be the pattern that matches a wide range of characters.
Upvotes: 3