Reputation: 3101
I am trying to clear up the results for poor quality OCR reads, attempting to remove everything I can safely assume is a mistake.
The desired result is a 6 digit numerical string, so I can rule out any character that isn't a digit from the results. I also know these numbers appear sequentially, so any numbers out of sequence are also very likely to be incorrect.
(Yes, fixing the quality would be best but no... they won't/can't change their documents)
I immediately Trim()
to remove white space, also as these are going to end up as file names I also remove all illegal characters.
I've found out which Characters are digits and added them to a dictionary against the array position in which they where found. This leaves me with a clear visual indication of the number sequencies but I am struggling on the logic of how to get my program to recognise this.
Tested with the string "Oct', 2$3622" (an actual bad read) The ideal output for this would be "3662"
public String FindLongest(string OcrText)
{
try
{
Char[] text = OcrText.ToCharArray();
List<char> numbers = new List<char>();
Dictionary<int, char> consec = new Dictionary<int, char>();
for (int a = 0; a < text.Length; a++)
{
if (Char.IsDigit(text[a]))
{
consec.Add(a, text[a]);
// Won't allow duplicates?
//consec.Add(text[a].ToString(), true);
}
}
foreach (var item in consec.Keys)
{
#region Idea that didn't work
// Combine values with consecutive keys into new list
// With most consecutive?
for (int i = 0; i < consec.Count; i++)
{
// if index key doesn't match loop, value was not consecutive
// Ah... falsely assuming it will start at 1. Won't work.
if (item == i)
numbers.Add(consec[item]);
else
numbers.Add(Convert.ToChar("#")); //string split value
}
#endregion
}
return null;
}
catch (Exception ex)
{
string message;
if (ex.InnerException != null)
message =
"Exception: " + ex.Message +
"\r\n" +
"Inner: " + ex.InnerException.Message;
else
message = "Exception: " + ex.Message;
MessageBox.Show(message);
return null;
}
}
Upvotes: 1
Views: 2665
Reputation: 329
var split = Regex.Split(OcrText, @"\D+").ToList();
var longest = (from s in split
orderby s.Length descending
select s).FirstOrDefault();
I would recommend using a Regex.Split using \D (@"\D+" in code) which finds all characters that are not digits. I would then perform a Linq query to find the longest string by .Length.
As you can see, it's both simple and very readable.
Upvotes: 1
Reputation: 38446
Since you strictly want numeric matches, I would suggest using a regex that matches (\d+)
.
MatchCollection matches = Regex.Matches(input, @"(\d+)");
string longest = string.Empty;
foreach (Match match in matches) {
if (match.Success) {
if (match.Value.Length > longest.Length) longest = match.Value;
}
}
This will give you the number of the longest length. If you wanted to actually compare values (which would also work with the "longest length", but could solve an issue with same-length matches):
MatchCollection matches = Regex.Matches(input, @"(\d+)");
int biggest = 0;
foreach (Match match in matches) {
if (match.Success) {
int current = 0;
int.TryParse(match.Value, out current);
if (current > biggest) biggest = current;
}
}
Upvotes: 1
Reputation: 6605
so you just need find the longest # sequence? why not use regex?
Regex reg = new Regex("\d+");
Matches mc = reg.Matches(input);
foreach (Match mt in mc)
{
// mt.Groups[0].Value.Length is the len of the sequence
// just find the longest
}
Just a thought.
Upvotes: 1
Reputation: 116411
A quick and dirty way to get the longest sequence of digits would be by using a Regex like this:
var t = "sfas234sdfsdf55323sdfasdf23";
var longest = Regex.Matches(t, @"\d+").Cast<Match>().OrderByDescending(m => m.Length).First();
Console.WriteLine(longest);
This will actually get all the sequences and obviously you can use LINQ to select the longest of these.
This doesn't handle multiple sequences of the same length.
Upvotes: 5