David542
David542

Reputation: 110163

MySQL query to find similar values in a single column

I have duplicate entries that are highly similar, but not exact. Here are some examples:

- 2016: Obama's America
- 2016: Obama's America (VF)

- Hurt Locker
- The Hurt Locker

What would be a query that I could use to get potentially similar titles ?

Update

Please note that I am not trying to remove EXACT duplicates. I am only trying to select similar values in a single column.

Upvotes: 11

Views: 6853

Answers (2)

sgeddes
sgeddes

Reputation: 62841

Not sure this is the best way or most efficient, and it definitely depends on the meaning of similar. If the meaning is the title contains all of the text in one row but some of the text in another row, then something like this should work:

SELECT DISTINCT T.Title
FROM YourTable T
   LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
UNION 
SELECT DISTINCT T2.Title
FROM YourTable T
   LEFT JOIN YourTable T2 ON T.Title != T2.Title
WHERE T.Title LIKE CONCAT('%', T2.Title, '%')
ORDER BY Title

And here is the SQL Fiddle.

Upvotes: 3

Bulat
Bulat

Reputation: 6969

I think this can be solved by measuring the distance between strings with some string metric.

Levenshtein seems to be the most well known metric and I have used some implementation of it in Oracle. It is implemented for MySQL also. You might find some other metric that will work better for you.

Upvotes: 4

Related Questions