Reputation: 123
I have a column containing text input from various users around the world and I would like to group them so that strings that are similar to each other are closer together. The column contains strings that are in languages other than English and in addition can contain misspelled words.
I have been looking into N-grams and Levenshtein distance but those seem to require that I have a string in mind to match the rows against.
As an example, I would like
|Comcast |
|how to play basketball|
|Walmart |
|www.Comcast.net |
|Wamlart |
|basketball |
to end up looking like:
|Comcast |
|www.Comcast.net |
|how to play basketball|
|basketball |
|Walmart |
|Wamlart |
Any help would be appreciated. Thank you.
Upvotes: 0
Views: 911
Reputation: 44343
If all the strings have things to remove like '.net' before displaying the data, you could pull off something like this:
SELECT column,ordercolumn FROM
(
SELECT column,REPLACE(column,'www.comcast.net','ComCast') ordercolumn
FROM mytable
) A
ORDER BY ordercolumn,column;
Upvotes: 0
Reputation: 6826
Use ORDER BY
instead of GROUP BY
SELECT * FROM table ORDER BY column ASC
SELECT * FROM table ORDER BY column DESC
The column will be ordered alphabetically ascending or descending
Upvotes: 0
Reputation: 171371
You could try ordering by SOUNDEX
code.
E.g.,
ORDER BY SOUNDEX(MyColumn)
This will break down on longer strings, however. You will probably have better luck using the Double Metaphone algorithm:
http://www.atomodo.com/code/double-metaphone/metaphone.sql/view
Upvotes: 1