user1036719
user1036719

Reputation: 1076

Whole word matching with dot characters in MySQL

In MySQL, when searching for a keyword in a text field where only "whole word match" is desired, one could use REGEXP and the [[:<:]] and [[:>:]] word-boundary markers:

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]word[[:>:]]"

For example, when we want to find all text fields containing "europe", using

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]europe[[:>:]]"

would return "europe map", but not "european union".

However, when the target matching words contains "dot characters", like "u.s.", how should I submit a proper query? I tried the following queries but none of them look correct.

1.

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u.s.[[:>:]]"

2.

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[.]s[.][[:>:]]"

3.

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\.s\.[[:>:]]"

When using double backslash to escape special characters, as suggested by d'alar'cop, it returns empty, even though there are something like "u.s. congress" in the table

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.[[:>:]]"

Any suggestion is appreciated!

Upvotes: 7

Views: 12406

Answers (5)

Daniel Gimenez
Daniel Gimenez

Reputation: 20494

The fundamental issue with your predicates is that . is a non-word character, and any non-word character will cause the word boundary test to fail if they follow a start test or precede an end test. You can see the behavior here.

To further complicate the issue, the flavor of regular expressions used by MySQL is very limited. According to Regular-Expressions.info, MySQL uses POSIX-ERE which if you read the chart at the bottom Regular Expression Flavor Comparisons has very few capabilities where compared to other flavors.

To solve your problem you must create a new regular expression that will replace the functionality of the word boundary so that it will allow non-word characters to be part of the boundary. I came up with the follow Regular Expression:

(^|[^[:alnum:]_])YOUR_TEXT_HERE($|[^[:alnum:]_])

This is equivalent to the standard regular expression below:

(^|[^a-zA-Z0-9_])YOUR_TEXT_HERE($|[^a-zA-Z0-9_])

The regex searches for non-words characters or string boundaries at the start and end of the text. (^|[^[:alnum:]_]) matches either start of string, an alpha-numeric character, or an underscore. The ending pattern is similar except it matches the end of a string instead of the start.

The pattern was designed to best match the definition of word boundaries from Regular Expressions in the MySQL manual:

[Boundaries] match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore.

Test Results

Using the regex above, I came up with a scenario where I test a string that contains non-word characters at the start and end - .u.s.. I tried to come up with a reasonable set of test items. You can see the results at SQLFiddle.

Test Data

test string not present:                 'no match' 
missing .'s:                             'no us match' 
missing last .:                          'no u.s match' 
missing first .:                         'no us. match' 
test start boundary word character:      'no.u.s.match'   
test end boundary word character:        'no .u.s.match'   
test boundaries word character:          'no.u.s.match'   
test basic success case:                 'yes .u.s. match' 
test start boundary non-word character:  'yes !.u.s. match'   
test end boundary non-word character:    'yes .u.s.! match'   
test boundaries non-word character:      'yes !.u.s.! match' 
test start of line:                      '.u.s.! yes match'   
test end of line:                        'yes match .u.s.' 

Query

SELECT *
FROM TestRegex
WHERE name REGEXP '(^|[^[:alnum:]_])[.]u[.]s[.]($|[^[:alnum:]_])'; 

SQLFiddle

Conclusion

All the positive cases were returned and none of the negative ones => All test cases succeeded.

  • You can use [.] for the period character instead of \\. which I find to be somewhat more readable in the context of a SQL expression.
  • You can adjust the sets used to define the boundary to be more or less restrictive depending on your desires. For example you can restrict some non-word characters as well: [^a-zA-Z_0-9.!?#$].

Upvotes: 9

Bohemian
Bohemian

Reputation: 424993

This regex does what you want:

SELECT name
FROM tbl_name
WHERE name REGEXP '([[:blank:][:punct:]]|^)u[.]s[.]([[:punct:][:blank:]]|$)'

This matches u.s. when preceeded by:

  • a blank (space, tab etc)
  • punctuation (comma, bracket etc)
  • nothing (ie at start of line)

and followed by:

  • a blank (space, tab etc)
  • punctuation (comma, bracket etc)
  • nothing (ie at end of line)

See an SQLFiddle with edge cases covering above points.

Upvotes: 13

electrofant
electrofant

Reputation: 945

In the mysql regexp manual is a table of special chars and howto escape them.

Doing your query like

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[.]s[.][[:>:]]"

or

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[[.period.]]s[[.period.]][[:>:]]"

will work

Upvotes: -1

asontu
asontu

Reputation: 4659

Working example here: http://www.sqlfiddle.com/#!2/5aa90d/9/0

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.([^[:alnum:]]|$)"

Basically saying that u.s. must be followed by anything that isn't an alphanumeric character, or the end of the string.

You could change [:alnum:] to [:alpha:] to include results like This is u.s.5 if that's desirable.

Upvotes: 1

anubhava
anubhava

Reputation: 785038

Just use this query:

SELECT name FROM tbl_name WHERE name REGEXP ""[[:<:]]u\\.s\\.([[:blank:]]|$)"

No need to use end-of-word [[:>:]] on RHS since you already have a dot after s.

Upvotes: 0

Related Questions