Stewart Smith
Stewart Smith

Reputation: 1416

MySQL REGEXP not behaving as expected

My issue is that regexp seems to not be giving the same result given the same pattern in a different context. For example,

str REGEXP 'a|b|c|d|e'
//not equal to
str REGEXP 'a' OR str REGEXP 'b' OR ...

I perform the following query on a large data set where 'content' is text in the form of source code files. I do the inner REGEXP to search for any source files containing my list of keywords. Once I have that full list, I go through it again and check which specific keyword tripped it. This is where there is discrepancy; some files trip when checked against all keywords, but trip nothing when checked against each individual keyword.

select
    source_histories.id,
    MAX(source_histories.master_event_id) as master_event_id,
    source_histories.source_file_id,
    source_histories.content REGEXP '[;{}[:space:]]break;' as break,
    source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
    source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
    source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
    source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
    source_histories,
    (SELECT 
        DISTINCT source_file_id
    from
        source_histories
    where
        ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;'
    LIMIT 100
    ) as sourceIdList
where
    source_histories.source_file_id = sourceIdList.source_file_id 
group by
    source_histories.source_file_id;

A section of the results that contains the issue follows. As you can see, source_file_id 92 and 95 did not match any of the keywords when checked individually, but must have matched when checked against all keywords. I have looked through their source codes, and they do contain one or more of the keywords.

id  master_event_id  source_file_id  break break_label  continue  throw  return_void 
256 3260             63              1     0            0         1      0
258 3640             65              1     0            0         0      0
259 3640             66              0     0            0         1      0
320 93722            85              1     0            0         0      0
346 471              92              0     0            0         0      0
360 93731            95              0     0            0         0      0
483 96052            108             1     0            0         0      0
536 1010             112             0     0            0         1      0

Does anyone have a suggestion as to what my problem may be? Is this due to a minor over sight, or a nuance of mySQL?

SOLUTION: The issue was in the order I was analyzing my data. I was finding unique source_file_id's that matched my criteria, but there was no guarantee that the corresponding latest version of the file (max master_event_id) had the keyword(s) as-well. the following (however slow) is the solution I have found.

select
    source_histories.id,
    source_histories.master_event_id as master_event_id,
    source_histories.source_file_id,
    source_histories.content REGEXP '[;{}[:space:]]break;' as break,
    source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
    source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
    source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
    source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
    source_histories
    inner join
        (select
            source_histories.id,
            MAX(source_histories.master_event_id) as master_event_id,
            source_histories.source_file_id
        from
            source_histories
            inner join
                (SELECT
                    DISTINCT source_file_id
                FROM
                    source_histories
                LIMIT 100
                ) as distinctSHList
            on
                source_histories.source_file_id = distinctSHList.source_file_id
            group by
                source_file_id
        ) as lastestSourceList
    on source_histories.id = lastestSourceList.id
where
    ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;';

Upvotes: 2

Views: 94

Answers (1)

Robby Cornelissen
Robby Cornelissen

Reputation: 97341

The problem is not in the REGEX clause, but in the way you make your selection.

The subquery will take all source_histories records for a given source_file_id into account, while the main query (due to the grouping) will take only one source_histories record for a given source_file_id into account.

To verify, remove the GROUP BY and MAX clauses from your query and join on the source_histories.id; the results should match.

select
    source_histories.id,
    source_histories.source_file_id,
    source_histories.content REGEXP '[;{}[:space:]]break;' as break,
    source_histories.content REGEXP '[;{}[:space:]]break ' as break_label,
    source_histories.content REGEXP '[;{}[:space:]]continue;' as `continue`,
    source_histories.content REGEXP '[;{}[:space:]]throw ' as throw,
    source_histories.content REGEXP '[;{}[:space:]]return;' as return_void
from
    source_histories,
    (SELECT 
        id
    from
        source_histories
    where
        ifnull(content, '') REGEXP '[;{}[:space:]]break;|[;{}[:space:]]break |[;{}[:space:]]continue;|[;{}[:space:]]throw |[;{}[:space:]]return;'
    LIMIT 100
    ) as sourceHistoriesIdList
where
    source_histories.id = sourceHistoriesIdList.id

Upvotes: 1

Related Questions