Reputation: 405
Hitting the db (MySQL) with these two queries one right after another I get different results:
test1 = Agreement.objects.filter(pk=152, company__iregex='СитиСтро(и|й)')
test2 = Agreement.objects.filter(pk=152, company__iregex='ситистро(и|й)')
test1 <QuerySet [<Agreement: Agreement object>]>
test2 <QuerySet []>
with the actual value if the field ' "СитиСтрой" '
Now i'm pretty sure that is Cyrillics that is messing things up, because with records in Latin alphabet it works fine, but I have no idea how to go around that (bug?). Any advice here?
PS I did double check, there is no confusion here with similar looking C letters of English and Russian, but with different letter codes.
Update: Checked the sql that Django sends to Mysql.
('SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company`, '
'FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`company` REGEXP '
'СитиСтро(и|й) AND `dbbs_app_agreement`.`id` = 152)')
Seems fine. Tried querying the table directly from phpmyadmin with
SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'С')
which worked, but
SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'с')
at the same time does not.
As @AndreyShipilov below offered, made a new table in the db from scratch with utf8_unicode_ci collation, inserted there the value in question (ООО "СитиСтрой") and tried these two queries from phpmyadmin:
SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'с')
SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'С')
Second one works, first one does not. Really weird.
update2 My initial code that formed the query looked like that:
query_ka_name = reduce(operator.and_,
(Q(company__iregex=r'(([^\w]|^){i}([^\w]|$))'.format(i=re.sub(r'и|й', '(и|й)', item, flags=re.IGNORECASE)))
the purpose of that being to check if a db record corresponded to the array of keywords recognized from a scan as a company name. Since the scanner is really bad with differentiating й from и, and db records are beyond my control I added that little thing to consider these letters as one.
Now the code looks like that:
query_ka_name = reduce(operator.and_, (Q(company__iregex=tambourine(item)) for item in ka_name_listed))
def tambourine(string):
string = re.sub(r'и|й', '(и|й)', string, flags=re.IGNORECASE)
output = ''
for char in string:
if char.isalpha():
output = '{o}({u}|{l})'.format(o=output, u=char.upper(), l=char.lower())
else:
output = '{o}{c}'.format(o=output, c=char)
output = r'(([^\w]|^){i}([^\w]|$))'.format(i=output)
return output
that is probably slow as hell in comparison, but at least it works. Would still greatly appreciate a solution to the problem.
Upvotes: 0
Views: 254
Reputation: 142472
REGEXP
works with bytes not characters. hence only unaccented English letters work in REGEXP
; no Cyrillic letter can (reliably) work.REGEXP
should do a better job. Ref: https://mariadb.com/kb/en/mariadb/pcre/Upvotes: 1
Reputation: 9452
I suggest switching to Postgres database which handles non-latin symbols pretty good.
Just tried to reproduce your issue on my Django 1.10 and Postgres 9.6 setup.
from django.contrib.auth.models import User
users = User.objects.filter(username__iregex='Сосницки(и|й)')
users
<QuerySet [<User: Сосницкий>, <User: сосницкий>, <User: сосницкии>, <User: СоСницкии>]>
Seems to be working.
Upvotes: 0