Optimize matching elements from two large data sets using Levenshtein distance (comparing each element to each other element)

Question

I am currently working on a problem to find the best matches of data in a List namely "ListA" from another List called "ListB". Whenever I find a match of an element for "ListA" with any element in "ListB" which has a confidence and accuracy with 70% or greater I add the matched string from List B and the string in List A to a tuple which i further save in a database.

Levenshtien algorithm gives me a number which I compare with my threshold of 70 and if the values returned is equal or greater than the 70 percent threshold I append it with the original string element of "ListA".

The code that I have written for this process works fine if the records in "ListA" and "ListB" are within thousands of values and if I increase the records to a million it takes about an hour to calculate the distance for each element of the List A.

I need to optimize the process for huge data sets. Please advise where do I need to make the improvements.

My code for the process so far looks like this

  public static PerformFuzzyMatch()
  {
    // Fetch the ListA & List B from SQL tables
     var ListACount = await FuzzyMatchRepo.FetchListACount();
     var ListB = await FuzzyMatchRepo.FetchListBAsync();

    //Split the ListA data to smaller chunks and loop through those chunks 
     var splitGroupSize = 1000;
     var sourceDataBatchesCount = ListACount / splitGroupSize;

     // Loop through the smaller chunks of List A
     for (int b = 0; b < sourceDataBatchesCount; b++)
     {
       var currentBatchMatchedWords = new List>();
       int skipRowCount = b * splitGroupSize;
       int takeRowCount = splitGroupSize;

       // Get chunks of data from ListA according to the skipRowCount and takeRowCount
   var currentSourceDataBatch = FuzzyMatchRepository.FetchSourceDataBatch(skipRowCount, takeRowCount);

 //Loop through the ListB and parallely calculate the distance between chunks of List A and List B data
   for (int i = 0; i < ListB.Count; i++)
   {
     Parallel.For(
      0,
      currentSourceDataBatch.Count,
      new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 },
      cntr =>
      {
         try
         {
            //call the Levenshtien Algorithm to calculate the distance between each element of ListB and the smaller chunk of List A.
              int leven = LevenshteinDistance(currentSourceDataBatch[cntr], ListB[i]);
              int length = Math.Max(currentSourceDataBatch[cntr].Length, ListB[i].Length);
              double similarity = double similarity = 1.0 - (double)leven / length;
              if ((similarity * 100) >= 70)
              {
     currentBatchMatchedWords.Add(Tuple.Create(currentSourceDataBatch[cntr], ListB[i], similarity));
            }
          cntr++;
         }
        catch (Exception ex)
        {
         exceptions.Enqueue(ex);
        }
      });
     }
   }
  }

And the algorithm which it calls is to calculate the distance is

public static int LevenshteinDistance(this string input, string comparedTo, bool caseSensitive = false)
    {
        if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(comparedTo))
        {
            return -1;
        }

        if (!caseSensitive)
        {
            input = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(input);
            comparedTo = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(comparedTo);
        }

        int inputLen = input.Length;
        int comparedToLen = comparedTo.Length;

        int[,] matrix = new int[inputLen, comparedToLen];

        //initialize           
        for (var i = 0; i < inputLen; i++)
        {
            matrix[i, 0] = i;
        }
        for (var i = 0; i < comparedToLen; i++)
        {
            matrix[0, i] = i;
        }

        //analyze
        for (var i = 1; i < inputLen; i++)
        {
            ushort si = input[i - 1];
            for (var j = 1; j < comparedToLen; j++)
            {
                ushort tj = comparedTo[j - 1];
                int cost = (si == tj) ? 0 : 1;

                int above = matrix[i - 1, j];
                int left = matrix[i, j - 1];
                int diag = matrix[i - 1, j - 1];
                int cell = FindMinimumOptimized(above + 1, left + 1, diag + cost);

                //transposition
                if (i > 1 && j > 1)
                {
                    int trans = matrix[i - 2, j - 2] + 1;
                    if (input[i - 2] != comparedTo[j - 1])
                    {
                        trans++;
                    }
                    if (input[i - 1] != comparedTo[j - 2])
                    {
                        trans++;
                    }
                    if (cell > trans)
                    {
                        cell = trans;
                    }
                }
                matrix[i, j] = cell;
            }
        }
        return matrix[inputLen - 1, comparedToLen - 1];
    }

Implemenation of Finding Minimum Optimized

public static int FindMinimumOptimized(int a, int b, int c)
    {
        return Math.Min(a, Math.Min(b, c));
    }

Optimize matching elements from two large data sets using Levenshtein distance (comparing each element to each other element)

Answers (1)

Related Questions