CLDev
CLDev

Reputation: 1567

Matching strings without space and punctuation in MySQL

I'm working on a query which I thought should be quite intuitive, but somehow I'm facing a bit of issues when implementing it. I guess what I'm trying to achieve is to match a string stored in MySQL DB without space and punctuation (other creative approaches are more than welcome). At the same time I would like the query to handle Unicode characters in diacritics insensitive fashion (so options like REGEXP are kinda out of luck). And the last condition is I'm on MySQL 5.5 with InnoDB engine, so full-text indexing is not supported (but I'm open to upgrade to 5.6/5.7 if it helps sorting this out).

Consider the scenario which the string Hello-World from John Doe is stored in DB. I would like to find it when given the search string HelloWorld or JohnDoe. To be more general, the string in DB can contain brackets, understores and any other punctuation (not limited to ASCII but can compromise for now), while the search string can be a combination of words with or without any separators in between. The closest I've gotten so far is to daisy chain the REPLACE function for a list of known punctuation, like below:

SELECT text FROM table WHERE REPLACE(REPLACE(text, '-', ''), ' ', '') LIKE '%JohnDoe%'

My questions are:

  1. Is there a better way instead of using the daisy chain above?
  2. If that's the only solution, how will the performance be impacted when I chain up hundred or more REPLACE functions?

Thanks in advance for your help.

Upvotes: 0

Views: 5459

Answers (3)

Gianluca Ghettini
Gianluca Ghettini

Reputation: 11628

Method 1

I would have another column on the schema containing an "hashed" version of the name, for example, let's say you have the user:

John Doe The Great

This name hashes to

johndoethegreat

The hash function is coded in such a way that all of the following strings:

John_Doe_THE_great
John Doe The GREAT
John.Doe.The.Great
johnDOE___theGreat
john   Doe   the     great
___john____DOE____THE____great

hash to the same value

johndoethegreat

It's trivial to write such a function. This way you can get the user input, hash it and then compare it against the hash column in your database

Names like:

Jon Doe
John Doo

will not be found of course

Method 2

Use the FULLTEXT search feature built-in in MySQL, sort the results by score and pick the first non zero entry

http://blog.oneiroi.co.uk/mysql/php/mysql-full-text-search-with-percentage-scoring/

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269603

I am totally missing the point of your question. You appear to have the string:

  • Hello-World from John Doe

If you want to find this when the search string is JohnDoe or John Doe, then you only need to substitute spaces:

where replace(text, ' ') like concat('%', 'JohnDoe', '%')

If you want a string that contains both "John" and "Doe" in that order, then:

where replace(text, ' ') like concat('%', 'John%Doe', '%')

I fail to see why 100 nested replace()s would be needed.

Upvotes: 0

user2620460
user2620460

Reputation: 188

I don't know how restrictive your searches must be, but you could try to strip out all non-alphanumeric characters from it, so that you end up with a string like "HelloWorldfromJohnDoe" that you match with instead.

Have a look at this answer: How to remove all non-alpha numeric characters from a string?

You might have to change it around a bit though to make it fir your purposes. I changed it from CHAR(32) to CHAR(255) to make sure I could get the column, but you might want to look into changing the function altogether to fit your data more precisely.

Then you something like this:

SELECT *
FROM testing
WHERE alphanum(test) LIKE CONCAT('%', alphanum('John Doe'), '%')

which should give you a hit.

Upvotes: 4

Related Questions