wtfowler
wtfowler

Reputation: 23

Find unicode numbers using regex (.NET)

I am attempting to find numbers from any numeral system in strings. I found that the .NET regular expression language supports finding unicode character categories, so I figured I could use that to capture my numbers (at this moment I can reasonably expect the strings I am reading to come from an UTF-8 encoded file).

The problem is that I can't seem to correctly identify all numerals. Here is a fiddle where I have attempted to identify a few numerals as such, but some are not identified as unicode numbers (The same results come from running a console app with the same code locally on .NET version 4.6.2). I have taken each of the test numerals in the fiddle from one of the unicode number category lists here.

Given this fiddle, it seems like the .NET regex language does not recognize all unicode numbers in the standard as numbers. Is this correct? It seems to get most cases correct, so I can probably still use this for what I am doing, but I'd like to know if I am doing something wrong, or if Microsoft has a statement I can't find which is relevant to this problem.

EDIT: Per commenter request, here is the code from the fiddle:

string[] numbers = new string[] { "1", "¼", "㆓", "⑱", "២", "꘩", "꤁", "〺", "፷", "𐌢", "𑁜","𑇩", "𒐘"};
string pattern = @"\p{N}";

foreach (string num in numbers ) {
    Console.WriteLine(string.Format("{0}, {1}", num, Regex.IsMatch(num, pattern))); 
}

And the output:

1, True
¼, True
㆓, True
⑱, True
២, True
꘩, True
꤁, True
〺, True
፷, True
𐌢, False
𑁜, False
𑇩, False
𒐘, False

Upvotes: 2

Views: 299

Answers (1)

Thomas English
Thomas English

Reputation: 231

The reason this happens is because strings in .NET are UTF-16 encoded.

Only characters in the Basic Multilingual Plane can be represented with 16 bit numbers equal to their code points. Any characters in the supplementary planes (U+10000 to U+10FFFF) have to be represented using surrogate pairs (they are encoded as a pair of 16 bit numbers).

For this reason, .NET will categorise any of the characters in these supplementary planes as a "Surrogate", rather than one of the other categories such as "LetterNumber", "OtherNumber", etc. This prevents them from matching the Number categories in the regex.

You can check which category .NET thinks a particular character belongs to by calling "Char.GetUnicodeCategory()".

Upvotes: 3

Related Questions