Reputation: 1335
I am currently working on a problem to find the best matches of data in a List namely "ListA" from another List called "ListB". Whenever I find a match of an element for "ListA" with any element in "ListB" which has a confidence and accuracy with 70% or greater I add the matched string from List B and the string in List A to a tuple which i further save in a database.
Levenshtien algorithm gives me a number which I compare with my threshold of 70 and if the values returned is equal or greater than the 70 percent threshold I append it with the original string element of "ListA".
The code that I have written for this process works fine if the records in "ListA" and "ListB" are within thousands of values and if I increase the records to a million it takes about an hour to calculate the distance for each element of the List A.
I need to optimize the process for huge data sets. Please advise where do I need to make the improvements.
My code for the process so far looks like this
public static PerformFuzzyMatch()
{
// Fetch the ListA & List B from SQL tables
var ListACount = await FuzzyMatchRepo.FetchListACount();
var ListB = await FuzzyMatchRepo.FetchListBAsync();
//Split the ListA data to smaller chunks and loop through those chunks
var splitGroupSize = 1000;
var sourceDataBatchesCount = ListACount / splitGroupSize;
// Loop through the smaller chunks of List A
for (int b = 0; b < sourceDataBatchesCount; b++)
{
var currentBatchMatchedWords = new List<Tuple<string, string, double>>();
int skipRowCount = b * splitGroupSize;
int takeRowCount = splitGroupSize;
// Get chunks of data from ListA according to the skipRowCount and takeRowCount
var currentSourceDataBatch = FuzzyMatchRepository.FetchSourceDataBatch(skipRowCount, takeRowCount);
//Loop through the ListB and parallely calculate the distance between chunks of List A and List B data
for (int i = 0; i < ListB.Count; i++)
{
Parallel.For(
0,
currentSourceDataBatch.Count,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 },
cntr =>
{
try
{
//call the Levenshtien Algorithm to calculate the distance between each element of ListB and the smaller chunk of List A.
int leven = LevenshteinDistance(currentSourceDataBatch[cntr], ListB[i]);
int length = Math.Max(currentSourceDataBatch[cntr].Length, ListB[i].Length);
double similarity = double similarity = 1.0 - (double)leven / length;
if ((similarity * 100) >= 70)
{
currentBatchMatchedWords.Add(Tuple.Create(currentSourceDataBatch[cntr], ListB[i], similarity));
}
cntr++;
}
catch (Exception ex)
{
exceptions.Enqueue(ex);
}
});
}
}
}
And the algorithm which it calls is to calculate the distance is
public static int LevenshteinDistance(this string input, string comparedTo, bool caseSensitive = false)
{
if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(comparedTo))
{
return -1;
}
if (!caseSensitive)
{
input = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(input);
comparedTo = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(comparedTo);
}
int inputLen = input.Length;
int comparedToLen = comparedTo.Length;
int[,] matrix = new int[inputLen, comparedToLen];
//initialize
for (var i = 0; i < inputLen; i++)
{
matrix[i, 0] = i;
}
for (var i = 0; i < comparedToLen; i++)
{
matrix[0, i] = i;
}
//analyze
for (var i = 1; i < inputLen; i++)
{
ushort si = input[i - 1];
for (var j = 1; j < comparedToLen; j++)
{
ushort tj = comparedTo[j - 1];
int cost = (si == tj) ? 0 : 1;
int above = matrix[i - 1, j];
int left = matrix[i, j - 1];
int diag = matrix[i - 1, j - 1];
int cell = FindMinimumOptimized(above + 1, left + 1, diag + cost);
//transposition
if (i > 1 && j > 1)
{
int trans = matrix[i - 2, j - 2] + 1;
if (input[i - 2] != comparedTo[j - 1])
{
trans++;
}
if (input[i - 1] != comparedTo[j - 2])
{
trans++;
}
if (cell > trans)
{
cell = trans;
}
}
matrix[i, j] = cell;
}
}
return matrix[inputLen - 1, comparedToLen - 1];
}
Implemenation of Finding Minimum Optimized
public static int FindMinimumOptimized(int a, int b, int c)
{
return Math.Min(a, Math.Min(b, c));
}
Upvotes: 1
Views: 1860
Reputation: 171178
This is a computation that inherently has quadratic cost O(N^2)
. This is always going to scale very badly as the number of items increases.
You can parallelize this but that's just a constant factor gain in the time it takes to execute.
Can you find some other criterion to match against that is based on equality? In that case you could use a hash based algorithm to very quickly create candidates for checking. For example, assuming that you are matching text articles and you expect almost all Levenstein matches to occur in articles written on the same calendar day, then you could match by date first (using O(N) complexity), and only then quadratically compare all items of that day. You could also compare items for the previous and next day to allow for some slack.
If you can't do that you have to accept quadratic scaling.
A good code pattern is this:
var pairs = (from a in listA
from b in listB //cross product
select new { a, b });
var matches =
pairs
.AsParallel()
.Where(pair => IsLevenshteinMatch(pair))
.ToList();
You can throw out all of that complex code you have written for this. When working with concurrency or parallelism it often pays to ponder the best design a bit. Often, there are highly compact solutions for common problems.
Upvotes: 2