Reputation: 237
I'm making a plagiarism detection program. The finished product will compare two entire text documents sentence-by-sentence; at this point I'm just testing my algorithm for comparing to sentences and giving a number between 0 and 1 for how similar their words are.
I'll try to step through the code and show you what the problem is.
Directives and a function declaration:
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <math.h>
#include <set>
double sntnc_cmpr_qtnt(const std::vector<std::string>&, const std::vector<std::string>&);
The main takes two arrays of strings and puts them into vectors. I know this seems useless, but it's just for my testing purposes. I calculate the sentence comparison quotient between the two vectors of strings (which are supposed to be 2 sentences).
int main (int argc, char* const argv[]) {
std::string arr1[] = {"Yo", "dawg", "I", "heard", "you", "like", "functions", "so", "we", "put", "a", "function", "inside"};
std::vector<std::string> str1, str2;
for (int i = 0; i < sizeof(arr1)/sizeof(std::string); ++i)
str1.push_back(arr1[i]);
std::string arr2[] = {"Yo", "dawg", "I", "heard", "you", "like", "cars", "so", "we", "put", "a", "car", "inside"};
for (int i = 0; i < sizeof(arr2)/sizeof(std::string); ++i)
str2.push_back(arr2[i]);
std::cout << sntnc_cmpr_qtnt(str1, str2);
return 0;
}
Here's the sentence comparison quotient function. It counts the number of words in common between the 2 sentences.
Something is going wrong, though. My count ("cnt") is coming out to 158, which is obviously too high. I can't figure out why that is.
double sntnc_cmpr_qtnt(const std::vector<std::string>& s1, const std::vector<std::string>& s2) {
// Place the words of sentences s1 and s2 each into seperate sets s1_set and s2_set:
std::set<std::string> s1set, s2set;
for (std::vector<std::string>::const_iterator it = s1.begin(); it != s1.end(); ++it)
s1set.insert(*it);
for (std::vector<std::string>::const_iterator it = s2.begin(); it != s2.end(); ++it)
s2set.insert(*it);
/* Compute the proportion of words in common between str1_set and str2_set,
multiped by 1 over 1 minus the squareroot of the size of the smaller set.
This is the sentence comparison quotient that is returned. */
double cnt(0.0);
for (std::set<std::string>::iterator it1 = s1set.begin(); it1 != s1set.end(); ++it1) {
for (std::set<std::string>::iterator it2 = s2set.begin(); it2 != s2set.end(); ++it2) {
if ((*it1).compare(*it2))
cnt += 1.0;
}
}
if (cnt == 0.0) {
return 0.0;
} else {
double minsz = (double)std::min(s1set.size(), s2set.size());
return ((1-1/sqrt(minsz))*cnt/minsz);
}
}
Upvotes: 1
Views: 128
Reputation: 241691
You can compare two std::string
's using the ==
operator.
But you're also doing a lot more work than is necessary. A much faster solution would be to put all the words from the first list into the set, and save the set's size. Then put all the words from the second list into the set. The difference between the final set size and the saved size is the number of words in the second list which are not in the first list. (Unique words, that is.) Of course, the saved size is the number of unique words in the first list.
A similar computation would get you the number of unique words in the second list, and the number of words in the first list not in the second list.
The total execution time is roughly proportional to the total number of words (actually, it's n log u
where u
is the number of unique words), whereas your solution is proportional to the product of the list sizes.
Upvotes: 3
Reputation: 10333
the main problem is here:
if ((*it1).compare(*it2))
cnt += 1.0;
This will increment cnt if they are NOT equal - compare returns 0 for equality
Also you have a set - instead for doing the inner loop, just call find:
for (std::set<std::string>::iterator it1 = s1set.begin(); it1 != s1set.end(); ++it1)
{
if (s2set.find(*it1) != s2set.end())
{
cnt += 1.0;
}
}
And I have no idea why cnt
should be double instead of int
Upvotes: 1