Reputation: 1416
My issue is that regexp seems to not be giving the same result given the same pattern in a different context. For example,
str REGEXP 'a|b|c|d|e'
//not equal to
str REGEXP 'a' OR str REGEXP 'b' OR ...
I perform the following query on a large data set where 'content' is text in the form of source code files. I do the inner REGEXP to search for any source files containing my list of keywords. Once I have that full list, I go through it again and check which specific keyword tripped it. This is where there is discrepancy; some files trip when checked against all keywords, but trip nothing when checked against each individual keyword.
select
source_histories.id,
MAX(source_histories.master_event_id) as master_event_id,
source_histories.source_file_id,
source_histories.content REGEXP '[;{}[:space:]]break;' as break,
source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
source_histories,
(SELECT
DISTINCT source_file_id
from
source_histories
where
ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;'
LIMIT 100
) as sourceIdList
where
source_histories.source_file_id = sourceIdList.source_file_id
group by
source_histories.source_file_id;
A section of the results that contains the issue follows. As you can see, source_file_id 92 and 95 did not match any of the keywords when checked individually, but must have matched when checked against all keywords. I have looked through their source codes, and they do contain one or more of the keywords.
id master_event_id source_file_id break break_label continue throw return_void
256 3260 63 1 0 0 1 0
258 3640 65 1 0 0 0 0
259 3640 66 0 0 0 1 0
320 93722 85 1 0 0 0 0
346 471 92 0 0 0 0 0
360 93731 95 0 0 0 0 0
483 96052 108 1 0 0 0 0
536 1010 112 0 0 0 1 0
Does anyone have a suggestion as to what my problem may be? Is this due to a minor over sight, or a nuance of mySQL?
SOLUTION: The issue was in the order I was analyzing my data. I was finding unique source_file_id's that matched my criteria, but there was no guarantee that the corresponding latest version of the file (max master_event_id) had the keyword(s) as-well. the following (however slow) is the solution I have found.
select
source_histories.id,
source_histories.master_event_id as master_event_id,
source_histories.source_file_id,
source_histories.content REGEXP '[;{}[:space:]]break;' as break,
source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
source_histories
inner join
(select
source_histories.id,
MAX(source_histories.master_event_id) as master_event_id,
source_histories.source_file_id
from
source_histories
inner join
(SELECT
DISTINCT source_file_id
FROM
source_histories
LIMIT 100
) as distinctSHList
on
source_histories.source_file_id = distinctSHList.source_file_id
group by
source_file_id
) as lastestSourceList
on source_histories.id = lastestSourceList.id
where
ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;';
Upvotes: 2
Views: 94
Reputation: 97341
The problem is not in the REGEX
clause, but in the way you make your selection.
The subquery will take all source_histories
records for a given source_file_id
into account, while the main query (due to the grouping) will take only one source_histories
record for a given source_file_id
into account.
To verify, remove the GROUP BY
and MAX
clauses from your query and join on the source_histories.id
; the results should match.
select
source_histories.id,
source_histories.source_file_id,
source_histories.content REGEXP '[;{}[:space:]]break;' as break,
source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
source_histories,
(SELECT
id
from
source_histories
where
ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;'
LIMIT 100
) as sourceHistoriesIdList
where
source_histories.id = sourceHistoriesIdList.id
Upvotes: 1