Marcus King
Marcus King

Reputation: 1677

Regular expression of unicode characters on string

I'm working in C# doing some OCR work and have extracted the text I need to work with. Now I need to parse a line using Regular Expressions.

string checkNum;
string routingNum;
string accountNum;
Regex regEx = new Regex(@"\u9288\d+\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
    checkNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\u9286\d{9}\u9286");
match = regEx.Match(numbers);
if(match.Success)
    routingNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\d{10}\u9288");
match = regEx.Match(numbers);
if (match.Success)
    accountNum = match.Value.Remove(match.Value.Length - 1, 1);

The problem is that the string contains the necessary Unicode characters when I do a .ToCharArray() and inspect the contents of the string, but it never seems to recognize the Unicode characters when I parse the string looking for them. I thought strings in C# were Unicode by default.

Upvotes: 4

Views: 13354

Answers (3)

Marcus King
Marcus King

Reputation: 1677

I figured it out. I was using the decimal values instead of the hex code In other words instead of using \u9288 and \u9286 I should have been using \u2448 and \u2446 http://www.ssec.wisc.edu/~tomw/java/unicode.html#x2440

Thanks guys for leading me in the right direction.

Upvotes: 4

bruno conde
bruno conde

Reputation: 48265

This line:

match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);

causes an exception because the resulting length from the first Remove is smaller than the original match.Value.Length.

I suggest you use groups to extract the value. Ex:

Regex regEx = new Regex(@"\u9288(\d+)\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
    checkNum = match.Groups[1].Value;

With that, I can extract the values correctly.

Upvotes: 1

Doug
Doug

Reputation: 5338

String in .NET are UTF-16 encoded.

Additionally, Regex engines don't match against Unicode characters, but against Unicode code points. See this post.

Upvotes: 0

Related Questions