Reputation: 11
I am trying to do search for patterns in MYSQL on some unstructured text fields, based on notes from employees that vary based on different data entry styles. Data entry might record the following for caffeine use:
User 1: 'Caffeine: Never'
User 2: 'Caffeine - Not much'
User 3: 'Caffeine: No'
User 4: 'Caffeine-No'
I am trying to find a way to search all records where it says "Caffeine" + (1 or more space OR no space) + (: OR - OR no character) + (1 or more space OR no space) + (NOT anything starting with 'N')
What I have tried is:
select * from table where text RLIKE [[:space:]][[:punct:]][[:space:]]*[^nN]';
The first part of the expression seems to work, but I'm finding exceptions in the result for the last part, excluding "n" or "N", and I'm not sure why my query isn't excluding. I'm hopeful this is something somebody can help clarify here.
Upvotes: 1
Views: 97
Reputation: 22817
You're still matching these sentences due to backtracking.
Because you're using [^nN]
, it can still match blank
or punct
. So this is what's happening (just using one of your sample inputs to illustrate it):
Using Caffeine[:blank:]*[:punct:]*[:blank:]*[^nN]
as an example (all of your patterns act in the same way).
User 4: 'Caffeine-No'
^^^^^^^^ matches Caffeine literally
^ matches [:blank:] zero times
^ matches [:punct:] one time
^ matches [:blank:] zero times
^ N doesn't match `[^nN]`, let's backtrack to see if something else works
^ matches [:punct:] zero times
^ matches [:blank:] zero times
^ - matches [^nN]
Good match due to backtracking
In the above, your regex allows [^nN]
to match the -
character. Regex wants to match something, so it'll exhaust every possibility until it does (or no possibilities remain and doesn't).
One way around this would be to specify only the possible characters at the position of [^nN]
(something like [0-9a-mo-z]
etc.), but that can get complicated quickly. A better alternative could be as follows:
select * from docs where content REGEXP 'Caffeine[[:blank:]]*[[:punct:]]*[[:blank:]]*[[:<:]][^nN]'
The above line uses [[:<:]]
to assert that the position is the start of a word boundary. Other languages use \b
to indicate the same. This means that it ensures any character except [0-9a-zA-Z_]
matches to the left of the position, and that any character in [0-9a-zA-Z_]
matches to the right of the position.
In other regex engines, the same could be easily achieved with the use of possessive quantifiers (typically the +
following a quantifier, e.g. .*+
), but MySQL doesn't yet have a possessive token (AFAIK).
Upvotes: 1
Reputation: 164679
Rather than trying to guess every possible variation, it's probably simplest to strip "Caffeine" and anything that's not a word character off the front and use the remainder. Use the POSIX [:alnum:]
(alphanumeric) character class and negate it.
select regexp_replace(answer, '^Caffeine[^[:alnum:]]+', '')
from quiz;
Then parse what remains.
Upvotes: 0