gourgan
gourgan

Reputation: 131

Levenshtein Distance on two files taking too much time

I am new to programming, and I am building a file similarity finder, which finds out how similar two files are. So far, I am storing the files as two strings and then using levenshtein distance for finding out how similar the files are.

The problem is, the execution time without levenshtein distance is 206ms, which is due to file to string conversion. When I use the levenshtein distance, execution time is a whopping 19504ms

Nearly 95 times the time taken to convert a file to a string, which makes this a bottleneck in my project

Any help would be appreciated I am comfortable in C, C++, and Python. If you can point me to any source, I would be grateful

Here is the C++ code for the function I am using for calculating Levenshtein distance:

//LEVENSHTEIN
int levenshtein(std::string a, std::string b){
  int len_a = a.length();
  int len_b = b.length();
  int d[len_a + 1][len_b+1];

  for(int i = 0; i < len_a + 1; i++)
    d[i][0] = i;

  for(int j = 0; j < len_b + 1; j++)
    d[0][j] = j;

  for(int i = 1; i < len_a + 1; i++){
    for(int j = 1; j < len_b + 1; j++){
      if(a[i - 1] == b[j - 1]){
        d[i][j] = d[i - 1][j - 1];
      }
      else{
        d[i][j] = 1 + min(min(d[i][j-1],d[i-1][j]),d[i-1][j-1]);
      }
    }
  }

  int answer = d[len_a][len_b];

  return answer;
}

I have to compare just two files, and not more. I read about the usage of trie in levenshtein, but that is useful for comparing multiple strings to the source. Apart from that, I haven't had much luck

Upvotes: 1

Views: 1159

Answers (2)

A M
A M

Reputation: 15277

I will show you a C++ solution. The language used is C++17. Compiler is MS Visual Studio Community 2019. Compiled in Release mode with all optimizations on.

I created two files with 1000 words each with an "Lorem ipsum sum" generator. The file size for each file is ~6kB.

The result is available in a blink of an eye.

I am using a slighty modified levensthein function and do also use more readable variable names. I do not use a VLA (Variable Length Array), because this is not valid in C++. I use a std::vector instead, which has even superior functionality.

In main, we can see the driver code. First, we open the 2 input files, and check, if they could be opened. If not, we show an error message and quit the program.

Then we read the 2 text files into 2 strings, by using the std::string range constructor and the std::istreambuf_iterator. I do not know any simpler way for reading a complete text file into a std::string.

Then we print the result of the Levensthein distance.

Please see the code below:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator>

// Distance between 2 strings
size_t levensthein(const std::string& string1, const std::string& string2)
{
    // First get the string lengths
    const size_t lengthString1{ string1.size() };
    const size_t lengthString2{ string2.size() };

    // If one of the string length is 0, then return the length of the other
    // This results in 0, if both lengths are 0
    if (lengthString1 == 0) return lengthString2;
    if (lengthString2 == 0) return lengthString1;

    // Initialize substitition cost vector
    std::vector<size_t> substitutionCost(lengthString2 + 1);
    std::iota(substitutionCost.begin(), substitutionCost.end(), 0);

    // Calculate substitution cost
    for (size_t indexString1{}; indexString1 < lengthString1; ++indexString1) {
        substitutionCost[0] = indexString1 + 1;
        size_t corner{ indexString1 };

        for (size_t indexString2{}; indexString2 < lengthString2; ++indexString2) {
            size_t upper{ substitutionCost[indexString2 + 1] };
            if (string1[indexString1] == string2[indexString2]) {
                substitutionCost[indexString2 + 1] = corner;
            }
            else {
                const size_t temp = std::min(upper, corner);
                substitutionCost[indexString2 + 1] = std::min(substitutionCost[indexString2], temp) + 1;
            }
            corner = upper;
        }
    }
    return substitutionCost[lengthString2];
}

// Put in your filenames here
const std::string fileName1{ "text1.txt" };
const std::string fileName2{ "text2.txt" };

int main() {

    // Open first file and check, if it could be opened
    if (std::ifstream file1Stream{ fileName1 }; file1Stream) {

        // Open second file and check, if it could be opened
        if (std::ifstream file2Stream{ fileName2 }; file2Stream) {

            // Both files are open now, read them into strings
            std::string stringFile1(std::istreambuf_iterator<char>(file1Stream), {});
            std::string stringFile2(std::istreambuf_iterator<char>(file2Stream), {});

            // Show Levenstehin distance on screen
            std::cout << "Levensthein distance is: " << levensthein(stringFile1, stringFile2) << '\n';
        }
        else {
            std::cerr << "\n*** Error. Could not open input file '" << fileName2 << "'\n";
        }
    }
    else {
        std::cerr << "\n*** Error. Could not open input file '" << fileName1 << "'\n";
    }
    return 0;
}

Upvotes: 1

Balaji Ambresh
Balaji Ambresh

Reputation: 5037

There's a package called nltk. Check it out.

from nltk import distance
print(distance.edit_distance('aa', 'ab'))

Output:

1

Upvotes: 0

Related Questions