pm1359
pm1359

Reputation: 632

Mysql query performance issues while using several REGEXP?

In my query I need to have several regular expression in order to filter the user name or email because I am not interesting on them. So, I have written this query in mysql and after running it , it took so many time to give back me result. I got problem wit performance. Moreover after running this query, even don't filtering my information in the correct way. I am not sure how can I improve my query in order to :

I will be appreciated for any help.

select DISTINCT t.user, vg_product_id,t.`platform`, pd.`mail`,
substring_index(group_concat(p.ts ORDER BY p.ts DESC SEPARATOR ','), ',', 1) as sub_start_ts, 
substring_index(group_concat(t.`expires_at`ORDER BY t.`expires_at` DESC SEPARATOR ','), ',', 1) as expired_time 
from users u
inner join tariff_subs_info t on (t.`user` = u.`user_xmpp_login` 
                                  and t.`user` NOT REGEXP ('^([A-Za-z]{2,3}(produsero|usero)+[0-9]{1,3})$' or '(\w+|\d+)?test(\w+|\d+)?' )
                                  and t.vg_product_id REGEXP "^(europe?|usa?|unlimited?|basic?)([a-zA-Z0-9]+|\_)+(and?|ios?)+$" )
left  join plus_data pd on (u.`user_xmpp_login` = pd.`user`)
inner join purchase_log p on (p.purchase_id = t.purchase_id)
WHERE (pd.mail not like '%guerrillamail.com' 
       or pd.mail is null) 
group by 1,2 ORDER BY DATE(p.ts);

and this is my result:

noadstestuser   basic_XXX_ios   ios NULL    2015-10-26 14:00:32 2015-10-26 14:05:24
brusero2    unlimited_XX_ios    ios [email protected] 2015-11-03 15:41:57 2015-11-03 15:46:45
brusero3    bXX_uscios  ios [email protected] 2015-11-03 15:43:53 2015-11-03 15:48:42
esusero1    unliXX_usc  ios [email protected]   2015-11-03 13:51:54 2015-11-03 13:56:41
esusero3    basic_X_i os    [email protected] 2015-11-03 13:55:08 2015-11-03 14:00:02
esusero4    basic_X ios [email protected] 2015-11-03 14:01:50 2015-11-03 14:06:38
esusero5    unXXXed_us  ios [email protected] 2015-11-03 14:45:38 2015-11-03 14:50:24
esusero6    basic_XX    ios [email protected] 2015-11-03 14:51:22 2015-11-03 14:56:09
esusero7    unlimXX_    ios [email protected] 2015-11-03 15:20:35 2015-11-03 15:25:24
esusero8    basXX_usc   ios [email protected] 2015-11-03 15:22:29 2015-11-03 15:27:14
flusero2    unlXXXe ios [email protected] 2015-11-03 16:57:58 2015-11-03 17:02:45
nlprodusero1    baXicXX_X   ios [email protected] 2015-11-03 14:06:52 2015-11-03 14:11:44
nlprodusero2    unliXXXeds  ios [email protected] 2015-11-03 14:08:28 2015-11-03 14:13:16
prodpurchasetest    baXXc_usXc  ios NULL    2015-11-03 09:20:51 2015-11-03 09:25:41
ukusero1    basicXXsca  ios [email protected] 2015-11-03 15:45:59 2015-11-03 15:48:42
ukusero2    baXXsca ios [email protected]   2015-11-03 17:00:14 2015-11-03 17:05:07
ukusero4    unlXXd_usc  ios [email protected] 2015-11-03 17:02:10 2015-11-03 17:02:45
usprodusero1    uXXited_us  ios [email protected] 2015-11-03 13:30:25 2015-11-03 13:35:14
usprodusero2    bXXXs   ios [email protected] 2015-11-03 13:33:39 2015-11-03 13:38:31
usprodusero5    unlXXsc ios [email protected] 2015-11-03 15:34:35 2015-11-03 15:39:26

This result are unexpected from me and I don't want to have them. Despite of using NOT REGEXP, all these lines came as my results. How can I solve these situation?

After edit:

select t.user, vg_product_id,t.`platform`, pd.`mail`,
substring_index(group_concat(p.ts ORDER BY p.ts DESC SEPARATOR ','), ',', 1) as sub_start_ts, 
substring_index(group_concat(t.`expires_at`ORDER BY t.`expires_at` DESC SEPARATOR ','), ',', 1) as expired_time 
from users u
inner join tariff_subs_info t on (t.`user` = u.`user_xmpp_login` 
                                  and t.`user` NOT REGEXP ('^([A-Za-z]{2,3}(produsero|usero)+[0-9]{1,3})$')
                                  and t.`user` NOT REGEXP ('test')
                                  and t.vg_product_id REGEXP ("^(europe?|usa?|unlimited?|basic?)([a-zA-Z0-9_]+)+(and?|ios?)+$" ))
left  join plus_data pd on (u.`user_xmpp_login` = pd.`user`)
inner join purchase_log p on (p.purchase_id = t.purchase_id)
WHERE (pd.mail not like '%guerrillamail.com' 
      and pd.mail NOT LIKE '%test%'
       or pd.mail is null) 
group by 1,2 ORDER BY DATE(p.ts);

I still have the following result , and 'test' in my user.

noadstestuser   basixxxf_ios    ios NULL    2015-10-26 14:00:32 2015-10-26 14:05:24
prodpurchasetest    basic_uscaxxs   ios NULL    2015-11-03 09:20:51 2015-11-03 09:25:41
esusertest  basic_uscxxxs   ios [email protected]   2015-11-04 13:53:48 2015-11-04 13:58:44
esusertest2 basic_uxxxx ios [email protected]  2015-11-04 14:11:12 2015-11-04 14:13:44

Upvotes: 1

Views: 110

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627302

MySQL REGEXP does not support \w and \d shorthand character classes. So, (\w+|\d+)? subpattern is invalid in MySQL. Since the ? quantifier makes the subpatterns optional (repeat one or zero times), you can remove them altogether.

Thus, or '(\w+|\d+)?test(\w+|\d+)?' will turn into and t.`user` NOT REGEXP ('test'), but it is equal in meaning to and t.`user` NOT LIKE '%test%'.

Next, ([a-zA-Z0-9]+|\_)+ is also very problematic since there are nested quantifiers (a + inside an alternation group that has another + quantifier applied). This is a classical scenario when catastrophical backtracking may occur. I suggest replacing this subpattern with [a-zA-Z0-9_]+ to match letters, digits or an underscore. Or its equivalent [[:alnum:]_]+.

Upvotes: 4

Arth
Arth

Reputation: 13110

OK first off:

t.`user` NOT REGEXP (
  'usero pattern' or
  'test pattern'
) 

is incorrect.. you can't or together two string inputs.. you need:

    t.`user` NOT REGEXP ('usero pattern')
AND t.`user` NOT REGEXP ('test pattern')

You should probably test all the expressions directly on the relevant tables first to make sure the logic is there.. as mentioned by @WiktorStribizew the word and digit character sets may not be recognized.

Next I'm guessing you don't need DISTINCT and GROUP BY in your query, you can probably just drop the DISTINCT.

Unfortunately, your regular expressions are going to hit every row.. there's not a lot you can do about that.

If it's a one off query, you'd probably just have to eat the time. You should probably run an EXPLAIN to check nothing crazy is happening to be sure though.

If you are running the queries regularly I suggest you split out the pertinent information via your application as you save the rows and then index the resultant extra columns, it won't be normalised.. but it's probably a situation where the performance boost will make it worth it.

UPDATE

For example your first two rows could be saved as:

[user:'noadstestuser', is_usero:0, is_test:1],
[user:'brusero2',      is_usero:1, is_test:0]

Then your

    t.`user` NOT REGEXP ('usero pattern')
AND t.`user` NOT REGEXP ('test pattern')

becomes simply

    t.is_usero = 0 AND t.is_test = 0

Much faster.. and you can index these fields if it helps.

Of course you'll have to do the matching before you save each row and this may be easier using your application logic (e.g. PHP code).

Upvotes: 1

Related Questions