Reputation: 1677
I'm working in C# doing some OCR work and have extracted the text I need to work with. Now I need to parse a line using Regular Expressions.
string checkNum;
string routingNum;
string accountNum;
Regex regEx = new Regex(@"\u9288\d+\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
checkNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\u9286\d{9}\u9286");
match = regEx.Match(numbers);
if(match.Success)
routingNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\d{10}\u9288");
match = regEx.Match(numbers);
if (match.Success)
accountNum = match.Value.Remove(match.Value.Length - 1, 1);
The problem is that the string contains the necessary Unicode characters when I do a .ToCharArray()
and inspect the contents of the string, but it never seems to recognize the Unicode characters when I parse the string looking for them. I thought strings in C# were Unicode by default.
Upvotes: 4
Views: 13354
Reputation: 1677
I figured it out. I was using the decimal values instead of the hex code
In other words instead of using \u9288 and \u9286
I should have been using \u2448 and \u2446
http://www.ssec.wisc.edu/~tomw/java/unicode.html#x2440
Thanks guys for leading me in the right direction.
Upvotes: 4
Reputation: 48265
This line:
match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
causes an exception because the resulting length from the first Remove
is smaller than the original match.Value.Length
.
I suggest you use groups to extract the value. Ex:
Regex regEx = new Regex(@"\u9288(\d+)\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
checkNum = match.Groups[1].Value;
With that, I can extract the values correctly.
Upvotes: 1
Reputation: 5338
String in .NET are UTF-16 encoded.
Additionally, Regex engines don't match against Unicode characters, but against Unicode code points. See this post.
Upvotes: 0