Reputation: 33
I have a table that contains human entered observation data. There is a column that is supposed to correspond to another list; the human entered value should identically match that in a sort of master list of possibilities.
The problem however is that the human data is abbreviated, misspelled, and etc. Is there a mechanism that does some sort of similarity search to find what the human entered data should actually be?
Examples
**Human Entered** **Should Be**
Carbon-12 Carbon(12)
South Korea Republic of Korea
farenheit Fahrenheit
The only thought I really have is to break up the Human Entered data into like 3 character sections and see if they are contained in the Should Be list. It would just pick the highest rated entry. As a later addition it could present the user with a choice of the top 10 or something.
I'm also not necessarily interested in an absolutely perfect solution, but if it worked like 70% right it would save A LOT of time going through the list.
Upvotes: 3
Views: 324
Reputation:
You can try to calculate the similarity of two strings using Levenshtein distance:
private static int CalcLevensteinDistance(string left, string right)
{
if (left == right)
return 0;
int[,] matrix = new int[left.Length + 1, right.Length + 1];
for (int i = 0; i <= left.Length; i++)
// delete
matrix[i, 0] = i;
for (int j = 0; j <= right.Length; j++)
// insert
matrix[0, j] = j;
for (int i = 0; i < left.Length; i++)
{
for (int j = 0; j < right.Length; j++)
{
if (left[i] == right[j])
matrix[i + 1, j + 1] = matrix[i, j];
else
{
// deletion or insertion
matrix[i + 1, j + 1] = System.Math.Min(matrix[i, j + 1] + 1, matrix[i + 1, j] + 1);
// substitution
matrix[i + 1, j + 1] = System.Math.Min(matrix[i + 1, j + 1], matrix[i, j] + 1);
}
}
}
return matrix[left.Length, right.Length];
}
Now calculate the similarity between two strings in %
public static double CalcSimilarity(string left, string right, bool ignoreCase)
{
if (ignoreCase)
{
left = left.ToLower();
right = right.ToLower();
}
double distance = CalcLevensteinDistance(left, right);
if (distance == 0.0f)
return 1.0f;
double longestStringSize = System.Math.Max(left.Length, right.Length);
double percent = distance / longestStringSize;
return 1.0f - percent;
}
Upvotes: 2
Reputation: 5479
Have you considered using a (...or several) drop down list(s) to enforce correct input? In my opinion, that would be a better approach in most cases when considering usability and user friendlyness. It would also make treatment of this input a lot easier. When just using free text input, you'd probably get a lot of different ways to write one thing, and you'll "never" be able to figure out every way of writing anything complex.
Example: As you wrote; "carbon-12", "Carbon 12", "Carbon ( 12 )", "Carbon (12)", "Carbon - 12" etc... Just for this, the possibilities are nearly endless. When you also consider things like "South Korea" vs "Republic of Korea" where the mapping is not "1:1" (What about North Korea? Or just "Korea"?), this gets even harder.
Of course, I know nothing about your application and might be completely wrong. But usually, when you expect complex values in a certain format, a drop down list would in many cases make both your job as a developer easier, as well as give the end user a better experience.
Upvotes: 1
Reputation: 838376
One option is to look for a small Levenshtein distance between two strings rather than requiring an exact match. This would help find matches where there are minor spelling differences or typos.
Another option is to normalize the strings before comparing them. The normalization techniques that make sense depend on your specific application but it could for example involve:
You can then compare the normalized forms of the members of each list instead of the original forms. You may also want to consider using a case-insensitive comparison instead of a case-sensitive comparison.
Upvotes: 3