Reputation: 1472
I am using regex to parse data from an OCR'd document and I am struggling to match the scenarios where a 1000s comma separator has been misread as a dot, and also where the dot has been misread as a comma!
So if the true value is 1234567.89 printed as 1,234,567.89 but being misread as:
1.234,567.89
1,234.567.89
1,234,567,89
etc
I could probably sort this in C# but I'm sure that a regex could do it. Any regex-wizards out there that can help?
UPDATE:
I realise this is a pretty dumb question as the regex is pretty straight forward to catch all of these, it is then how I choose to interpret the match. Which will be in C#. Thanks - sorry to waste your time on this!
I will mark the answer to Dmitry as it is close to what I was looking for. Thank you.
Upvotes: 2
Views: 538
Reputation: 186748
Please notice, that there's ambiguity since:
123,456 // thousand separator
123.456 // decimal separator
are both possible (123456
and 123.456
). However, we can detect some cases:
123.456.789
123.456,789
123,45
So we can set up a rule: the separator can be decimal one if it's the last one and not followed by exactly three digits (see ambiguity above), all the other separators should be treated as thousand ones:
1?234?567?89
^ ^ ^
| | the last one, followed by two digits (not three), thus decimal
| not the last one, thus thousand
not the last one, thus thousand
Now let's implement a routine
private static String ClearUp(String value) {
String[] chunks = value.Split(',', '.');
// No separators
if (chunks.Length <= 1)
return value;
// Let's look at the last chunk
// definitely decimal separator (e.g. "123,45")
if (chunks[chunks.Length - 1].Length != 3)
return String.Concat(chunks.Take(chunks.Length - 1)) +
"." +
chunks[chunks.Length - 1];
// may be decimal or thousand
if (value[value.Length - 4] == ',')
return String.Concat(chunks);
else
return String.Concat(chunks.Take(chunks.Length - 1)) +
"." +
chunks[chunks.Length - 1];
}
Now let's try some tests:
String[] data = new String[] {
// you tests
"1.234,567.89",
"1,234.567.89",
"1,234,567,89",
// my tests
"123,456", // "," should be left intact, i.e. thousand separator
"123.456", // "." should be left intact, i.e. decimal separator
};
String report = String.Join(Environment.NewLine, data
.Select(item => String.Format("{0} -> {1}", item, ClearUp(item))));
Console.Write(report);
the outcome is
1.234,567.89 -> 1234567.89
1,234.567.89 -> 1234567.89
1,234,567,89 -> 1234567.89
123,456 -> 123456
123.456 -> 123.456
Upvotes: 3
Reputation: 3399
Responding to update/comments: you do not need regex to do this. Instead, if you can isolate the number string from the surrounding spaces, you can pull it into a string-array using Split(',','.')
. Based on the logic you outlined above, you could then use the last element of the array as the fractional part, and concatenate the first elements together for the whole part. (Actual code left as an exercise...) This will even work if the ambiguous-dot-or-comma is the last character in the string: the last element in the split-array will be empty.
Caveat: This will only work if there is always a decimal point--otherwise, you would not be able to differentiate logically between a thousands-place comma and a decimal with thousandths.
Upvotes: 1
Reputation:
Try this Regex:
\b[\.,\d][^\s]*\b
\b = Word boundaries containing: . or comma or digits Not containing spaces
Upvotes: 1