Reputation: 1579
I have following situation:
String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web";
Is there any idea or standard algorithm to calculate the percentage of similarity?
For instance, above case, the similarity estimated by manual looking should be 90%++.
My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....
Can anyone give some guidelines???
Above is part of my project, Plagiarism Analyzer.
Hence, the words matched will be exactly same without any synonyms.
The only matters in this case is that how to calculate a quite accurate percentage of similarity.
Thanks a lot for any helps.
Upvotes: 4
Views: 4628
Reputation: 10173
As Konrad pointed out, your question depends heavily on what you mean by "similar". In general, I would say the following guidelines should be of use:
You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.
Upvotes: 5
Reputation: 2936
One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library
Upvotes: 1
Reputation: 4141
I second what Konrad Rudolf has already said.
Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.
Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well
Upvotes: 2
Reputation: 10526
The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).
The human-similarity can never be easily calculated: for instance these three words
cellphone car message
mobile automobile post
The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.
Upvotes: 1
Reputation: 121
The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming
Upvotes: 0
Reputation: 545875
That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”
A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(nm) in general, where n and m are the lengths of the strings.
Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.
In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.
Upvotes: 4