Reputation: 131
I am new to programming, and I am building a file similarity finder, which finds out how similar two files are. So far, I am storing the files as two strings and then using levenshtein distance for finding out how similar the files are.
The problem is, the execution time without levenshtein distance is 206ms, which is due to file to string conversion. When I use the levenshtein distance, execution time is a whopping 19504ms
Nearly 95 times the time taken to convert a file to a string, which makes this a bottleneck in my project
Any help would be appreciated I am comfortable in C, C++, and Python. If you can point me to any source, I would be grateful
Here is the C++ code for the function I am using for calculating Levenshtein distance:
//LEVENSHTEIN
int levenshtein(std::string a, std::string b){
int len_a = a.length();
int len_b = b.length();
int d[len_a + 1][len_b+1];
for(int i = 0; i < len_a + 1; i++)
d[i][0] = i;
for(int j = 0; j < len_b + 1; j++)
d[0][j] = j;
for(int i = 1; i < len_a + 1; i++){
for(int j = 1; j < len_b + 1; j++){
if(a[i - 1] == b[j - 1]){
d[i][j] = d[i - 1][j - 1];
}
else{
d[i][j] = 1 + min(min(d[i][j-1],d[i-1][j]),d[i-1][j-1]);
}
}
}
int answer = d[len_a][len_b];
return answer;
}
I have to compare just two files, and not more. I read about the usage of trie in levenshtein, but that is useful for comparing multiple strings to the source. Apart from that, I haven't had much luck
Upvotes: 1
Views: 1159
Reputation: 15277
I will show you a C++ solution. The language used is C++17. Compiler is MS Visual Studio Community 2019. Compiled in Release mode with all optimizations on.
I created two files with 1000 words each with an "Lorem ipsum sum" generator. The file size for each file is ~6kB.
The result is available in a blink of an eye.
I am using a slighty modified levensthein function and do also use more readable variable names. I do not use a VLA (Variable Length Array), because this is not valid in C++. I use a std::vector
instead, which has even superior functionality.
In main, we can see the driver code. First, we open the 2 input files, and check, if they could be opened. If not, we show an error message and quit the program.
Then we read the 2 text files into 2 strings, by using the std::string
range constructor and the std::istreambuf_iterator
. I do not know any simpler way for reading a complete text file into a std::string
.
Then we print the result of the Levensthein distance.
Please see the code below:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator>
// Distance between 2 strings
size_t levensthein(const std::string& string1, const std::string& string2)
{
// First get the string lengths
const size_t lengthString1{ string1.size() };
const size_t lengthString2{ string2.size() };
// If one of the string length is 0, then return the length of the other
// This results in 0, if both lengths are 0
if (lengthString1 == 0) return lengthString2;
if (lengthString2 == 0) return lengthString1;
// Initialize substitition cost vector
std::vector<size_t> substitutionCost(lengthString2 + 1);
std::iota(substitutionCost.begin(), substitutionCost.end(), 0);
// Calculate substitution cost
for (size_t indexString1{}; indexString1 < lengthString1; ++indexString1) {
substitutionCost[0] = indexString1 + 1;
size_t corner{ indexString1 };
for (size_t indexString2{}; indexString2 < lengthString2; ++indexString2) {
size_t upper{ substitutionCost[indexString2 + 1] };
if (string1[indexString1] == string2[indexString2]) {
substitutionCost[indexString2 + 1] = corner;
}
else {
const size_t temp = std::min(upper, corner);
substitutionCost[indexString2 + 1] = std::min(substitutionCost[indexString2], temp) + 1;
}
corner = upper;
}
}
return substitutionCost[lengthString2];
}
// Put in your filenames here
const std::string fileName1{ "text1.txt" };
const std::string fileName2{ "text2.txt" };
int main() {
// Open first file and check, if it could be opened
if (std::ifstream file1Stream{ fileName1 }; file1Stream) {
// Open second file and check, if it could be opened
if (std::ifstream file2Stream{ fileName2 }; file2Stream) {
// Both files are open now, read them into strings
std::string stringFile1(std::istreambuf_iterator<char>(file1Stream), {});
std::string stringFile2(std::istreambuf_iterator<char>(file2Stream), {});
// Show Levenstehin distance on screen
std::cout << "Levensthein distance is: " << levensthein(stringFile1, stringFile2) << '\n';
}
else {
std::cerr << "\n*** Error. Could not open input file '" << fileName2 << "'\n";
}
}
else {
std::cerr << "\n*** Error. Could not open input file '" << fileName1 << "'\n";
}
return 0;
}
Upvotes: 1
Reputation: 5037
There's a package called nltk. Check it out.
from nltk import distance
print(distance.edit_distance('aa', 'ab'))
Output:
1
Upvotes: 0