Reputation: 2616
I am developing webservice in Java using REST framework.
I am using MySQL 5.1 database as a backend.
I am performing search operation on one of my table say Stops using like pattern.
But now I want to perform "Approximate_string_matching (fuzzy string searching)" for above search. Consider an e.g. for 23 ST stop, user can provide search string 23rd station, 23rd, 23 station, 23rd ST etc.
For this Approximate_string_matching algorithm I found the link http://en.wikipedia.org/wiki/Approximate_string_matching
But I dont know how to implement it.
Please guys help me to implement Approximate_string_matching algorithm in Java / MySQL?
Thank you in advance.
Upvotes: 1
Views: 3840
Reputation: 22461
As per your explanation it seems that whenever any user provides search string as 23rd station, 23rd, 23 station, or 23rd ST then the filtered output should be "23 ST stop", Right?
So I am assuming that all your stops names will be like XX YY stop where XX is a numerical value and YY is shortform for some station like ST, VT, MT etc
If that is right, then one way you may achieve this is by performing multiple filters such that output of first filter is input to the next filter. But before that you need to figure out "what to filter on"?
So in this particular case, it seems "23" is a must present substring in the start of the query string, so you need to extract the numeric part from your query string (you can use Java regex) apply the result as first filter, so in this case it will be:
where stops like '23%'
then on output of this result you can apply next filter, and that next filter in this case could be the first two letter of the next word (if present) and applying its lower case for consistency, so in this case it would be 'st':
where LOWER(stops) like '%st%'
Now you can achieve this in the query part itself by applying both the filters in the same query (try using subqueries) or you can bring in the resultset of first filter and apply the remaining filter on that resultset using Java regex.
Upvotes: 1
Reputation: 52185
One thing you might want to look into would be the Levenshtein Distance Algorithm:
Levenshtein distance is a string metric for measuring the difference between two sequences.
The Apache Commons Lang has an implementation of this readily available. You could use the getLevenshteinDistance(CharSequence s, CharSequence t, int threshold) to get the strings which are approximately equal to the given string. The threshold would come in handy so that you would be able to discard words which are a certain distance away from your source word, thus avoiding unneeded computation.
A better approach would be to use the Levenshtein function provided by MySQL iteself. A simple example of how to execute can be seen here.
Upvotes: 5