Reputation: 632
In my query I need to have several regular expression in order to filter the user name or email because I am not interesting on them. So, I have written this query in mysql and after running it , it took so many time to give back me result. I got problem wit performance. Moreover after running this query, even don't filtering my information in the correct way. I am not sure how can I improve my query in order to :
I will be appreciated for any help.
select DISTINCT t.user, vg_product_id,t.`platform`, pd.`mail`,
substring_index(group_concat(p.ts ORDER BY p.ts DESC SEPARATOR ','), ',', 1) as sub_start_ts,
substring_index(group_concat(t.`expires_at`ORDER BY t.`expires_at` DESC SEPARATOR ','), ',', 1) as expired_time
from users u
inner join tariff_subs_info t on (t.`user` = u.`user_xmpp_login`
and t.`user` NOT REGEXP ('^([A-Za-z]{2,3}(produsero|usero)+[0-9]{1,3})$' or '(\w+|\d+)?test(\w+|\d+)?' )
and t.vg_product_id REGEXP "^(europe?|usa?|unlimited?|basic?)([a-zA-Z0-9]+|\_)+(and?|ios?)+$" )
left join plus_data pd on (u.`user_xmpp_login` = pd.`user`)
inner join purchase_log p on (p.purchase_id = t.purchase_id)
WHERE (pd.mail not like '%guerrillamail.com'
or pd.mail is null)
group by 1,2 ORDER BY DATE(p.ts);
and this is my result:
noadstestuser basic_XXX_ios ios NULL 2015-10-26 14:00:32 2015-10-26 14:05:24
brusero2 unlimited_XX_ios ios [email protected] 2015-11-03 15:41:57 2015-11-03 15:46:45
brusero3 bXX_uscios ios [email protected] 2015-11-03 15:43:53 2015-11-03 15:48:42
esusero1 unliXX_usc ios [email protected] 2015-11-03 13:51:54 2015-11-03 13:56:41
esusero3 basic_X_i os [email protected] 2015-11-03 13:55:08 2015-11-03 14:00:02
esusero4 basic_X ios [email protected] 2015-11-03 14:01:50 2015-11-03 14:06:38
esusero5 unXXXed_us ios [email protected] 2015-11-03 14:45:38 2015-11-03 14:50:24
esusero6 basic_XX ios [email protected] 2015-11-03 14:51:22 2015-11-03 14:56:09
esusero7 unlimXX_ ios [email protected] 2015-11-03 15:20:35 2015-11-03 15:25:24
esusero8 basXX_usc ios [email protected] 2015-11-03 15:22:29 2015-11-03 15:27:14
flusero2 unlXXXe ios [email protected] 2015-11-03 16:57:58 2015-11-03 17:02:45
nlprodusero1 baXicXX_X ios [email protected] 2015-11-03 14:06:52 2015-11-03 14:11:44
nlprodusero2 unliXXXeds ios [email protected] 2015-11-03 14:08:28 2015-11-03 14:13:16
prodpurchasetest baXXc_usXc ios NULL 2015-11-03 09:20:51 2015-11-03 09:25:41
ukusero1 basicXXsca ios [email protected] 2015-11-03 15:45:59 2015-11-03 15:48:42
ukusero2 baXXsca ios [email protected] 2015-11-03 17:00:14 2015-11-03 17:05:07
ukusero4 unlXXd_usc ios [email protected] 2015-11-03 17:02:10 2015-11-03 17:02:45
usprodusero1 uXXited_us ios [email protected] 2015-11-03 13:30:25 2015-11-03 13:35:14
usprodusero2 bXXXs ios [email protected] 2015-11-03 13:33:39 2015-11-03 13:38:31
usprodusero5 unlXXsc ios [email protected] 2015-11-03 15:34:35 2015-11-03 15:39:26
This result are unexpected from me and I don't want to have them. Despite of using NOT REGEXP
, all these lines came as my results. How can I solve these situation?
After edit:
select t.user, vg_product_id,t.`platform`, pd.`mail`,
substring_index(group_concat(p.ts ORDER BY p.ts DESC SEPARATOR ','), ',', 1) as sub_start_ts,
substring_index(group_concat(t.`expires_at`ORDER BY t.`expires_at` DESC SEPARATOR ','), ',', 1) as expired_time
from users u
inner join tariff_subs_info t on (t.`user` = u.`user_xmpp_login`
and t.`user` NOT REGEXP ('^([A-Za-z]{2,3}(produsero|usero)+[0-9]{1,3})$')
and t.`user` NOT REGEXP ('test')
and t.vg_product_id REGEXP ("^(europe?|usa?|unlimited?|basic?)([a-zA-Z0-9_]+)+(and?|ios?)+$" ))
left join plus_data pd on (u.`user_xmpp_login` = pd.`user`)
inner join purchase_log p on (p.purchase_id = t.purchase_id)
WHERE (pd.mail not like '%guerrillamail.com'
and pd.mail NOT LIKE '%test%'
or pd.mail is null)
group by 1,2 ORDER BY DATE(p.ts);
I still have the following result , and 'test' in my user.
noadstestuser basixxxf_ios ios NULL 2015-10-26 14:00:32 2015-10-26 14:05:24
prodpurchasetest basic_uscaxxs ios NULL 2015-11-03 09:20:51 2015-11-03 09:25:41
esusertest basic_uscxxxs ios [email protected] 2015-11-04 13:53:48 2015-11-04 13:58:44
esusertest2 basic_uxxxx ios [email protected] 2015-11-04 14:11:12 2015-11-04 14:13:44
Upvotes: 1
Views: 110
Reputation: 627302
MySQL REGEXP does not support \w
and \d
shorthand character classes. So, (\w+|\d+)?
subpattern is invalid in MySQL. Since the ?
quantifier makes the subpatterns optional (repeat one or zero times), you can remove them altogether.
Thus, or '(\w+|\d+)?test(\w+|\d+)?'
will turn into and t.`user` NOT REGEXP ('test')
, but it is equal in meaning to and t.`user` NOT LIKE '%test%'
.
Next, ([a-zA-Z0-9]+|\_)+
is also very problematic since there are nested quantifiers (a +
inside an alternation group that has another +
quantifier applied). This is a classical scenario when catastrophical backtracking may occur. I suggest replacing this subpattern with [a-zA-Z0-9_]+
to match letters, digits or an underscore. Or its equivalent [[:alnum:]_]+
.
Upvotes: 4
Reputation: 13110
OK first off:
t.`user` NOT REGEXP (
'usero pattern' or
'test pattern'
)
is incorrect.. you can't or
together two string inputs.. you need:
t.`user` NOT REGEXP ('usero pattern')
AND t.`user` NOT REGEXP ('test pattern')
You should probably test all the expressions directly on the relevant tables first to make sure the logic is there.. as mentioned by @WiktorStribizew the word and digit character sets may not be recognized.
Next I'm guessing you don't need DISTINCT and GROUP BY in your query, you can probably just drop the DISTINCT.
Unfortunately, your regular expressions are going to hit every row.. there's not a lot you can do about that.
If it's a one off query, you'd probably just have to eat the time. You should probably run an EXPLAIN to check nothing crazy is happening to be sure though.
If you are running the queries regularly I suggest you split out the pertinent information via your application as you save the rows and then index the resultant extra columns, it won't be normalised.. but it's probably a situation where the performance boost will make it worth it.
UPDATE
For example your first two rows could be saved as:
[user:'noadstestuser', is_usero:0, is_test:1],
[user:'brusero2', is_usero:1, is_test:0]
Then your
t.`user` NOT REGEXP ('usero pattern')
AND t.`user` NOT REGEXP ('test pattern')
becomes simply
t.is_usero = 0 AND t.is_test = 0
Much faster.. and you can index these fields if it helps.
Of course you'll have to do the matching before you save each row and this may be easier using your application logic (e.g. PHP code).
Upvotes: 1