Reputation: 313
I want to find the way for comparing strings with each other in the way that it understand there is no difference between s1 and s2 in the following examples.
String s1 = "John: would you please one the door";
String s2 = "John: would you please one the door ????";
what should I do?
Upvotes: 0
Views: 109
Reputation: 6705
Similar implies that there are commonalities. This is a nontrivial problem. What you are really asking for is a relevance score and Faceted search. This is typically done by tokenizing a string into its base words and checking for the presence of common base words within the result. As a concrete example take the sentence:
"The shadowy figure fell upon them."
You can break this down into facets:
shadow
figure
fell
Each of these can be evaluated with synonyms:
shadow -> dark, shade, silhouette, etc...
figure -> statistic, number, quantity, amount, level, total, sum, silhouette, outline, shape, form, etc...
fell -> cut down, chop down, hack down, saw down, knock down/over, knock to the ground, strike down, bring down, bring to the ground, prostrate, etc...
Then the same is done to the comparative string, and the common facets are counted. The more common facets the higher the relevance of the match.
There are lots of fairly heavyweight tools like Lucene and Solr in the open source community that tackle this problem, but you may be able to do a simple version by breaking the string into tokens and simply looking for common tokens. A simple example:
public class TokenExample {
public static HashMap<String, Integer> tokenizeString(String s)
{
// process s1 into tokens
HashMap<String, Integer> map = new HashMap<String, Integer>();
for (String token : s.split("\\s+"))
{
// normalize the token
token = token.toLowerCase();
if ( map.containsKey(token) )
{
map.put(token, map.get(token)+1);
}
else
{
map.put(token, 1);
}
}
return map;
}
public static Integer getCommonalityCount(String s1, String s2)
{
HashMap<String, Integer> map1 = tokenizeString(s1);
HashMap<String, Integer> map2 = tokenizeString(s2);
Integer commonIndex = 0;
for (String token : map1.keySet())
{
if ( map2.containsKey(token))
{
commonIndex += 1;
// you could instead count for how often they match like this
// commonIndex += map2.get(token) + map1.get(token);
}
}
return commonIndex;
}
public static void main(String[] args) {
String s1 = "John: would you please one the door";
String s2= "John: would you please one the door ????";
String s3 = "John: get to the door and open it please ????";
String s4= "John: would you please one the door ????";
System.out.println("Commonality index: " + getCommonalityCount(s1, s2));
System.out.println("Commonality index: " + getCommonalityCount(s3, s4));
}
}
Upvotes: 1
Reputation: 68847
I'm not aware of any good techniques. But getting rid of multiple spaces and interpunction might be a start.
String s1, s2;
s1 = s1.replaceAll(" {2,}", " ").replaceAll("[.?!/\\()]", "").trim();
s2 = s2.replaceAll(" {2,}", " ").replaceAll("[.?!/\\()]", "").trim();
if (s1.equalsIgnoreCase(s1))
{
}
Demo that works on your string demo: http://ideone.com/FSHOJt
Upvotes: 1
Reputation: 274
There are various approach to this problem, and easy way to solve this problem use Levenshtein distance. Another approach is cosine similarity. you need more details, please comment.
Upvotes: -1
Reputation: 39386
The notion of similarity between Strings is described using a String metric. A basic example of a string metric is the Levenshtein distance (often referred to as Edit distance).
Wikibooks offers a Java implementation of this algorithm : http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java
Upvotes: 5