Reputation: 405

django: iregex is case sensitive

Hitting the db (MySQL) with these two queries one right after another I get different results:

test1 = Agreement.objects.filter(pk=152, company__iregex='СитиСтро(и|й)')
test2 = Agreement.objects.filter(pk=152, company__iregex='ситистро(и|й)')

test1 <QuerySet [<Agreement: Agreement object>]>
test2 <QuerySet []>

with the actual value if the field ' "СитиСтрой" '

Now i'm pretty sure that is Cyrillics that is messing things up, because with records in Latin alphabet it works fine, but I have no idea how to go around that (bug?). Any advice here?

PS I did double check, there is no confusion here with similar looking C letters of English and Russian, but with different letter codes.

Update: Checked the sql that Django sends to Mysql.

('SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company`, '
 'FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`company` REGEXP '
 'СитиСтро(и|й) AND `dbbs_app_agreement`.`id` = 152)')

Seems fine. Tried querying the table directly from phpmyadmin with

SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'С')

which worked, but

SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'с')

at the same time does not.

As @AndreyShipilov below offered, made a new table in the db from scratch with utf8_unicode_ci collation, inserted there the value in question (ООО "СитиСтрой") and tried these two queries from phpmyadmin:

SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'с')
SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'С')

Second one works, first one does not. Really weird.

update2 My initial code that formed the query looked like that:

query_ka_name = reduce(operator.and_,
(Q(company__iregex=r'(([^\w]|^){i}([^\w]|$))'.format(i=re.sub(r'и|й', '(и|й)', item, flags=re.IGNORECASE)))

the purpose of that being to check if a db record corresponded to the array of keywords recognized from a scan as a company name. Since the scanner is really bad with differentiating й from и, and db records are beyond my control I added that little thing to consider these letters as one.

Now the code looks like that:

query_ka_name = reduce(operator.and_, (Q(company__iregex=tambourine(item)) for item in ka_name_listed))

def tambourine(string):
    string = re.sub(r'и|й', '(и|й)', string, flags=re.IGNORECASE)
    output = ''
    for char in string:
        if char.isalpha():
            output = '{o}({u}|{l})'.format(o=output, u=char.upper(), l=char.lower())
        else:
            output = '{o}{c}'.format(o=output, c=char)
    output = r'(([^\w]|^){i}([^\w]|$))'.format(i=output)
    return output

that is probably slow as hell in comparison, but at least it works. Would still greatly appreciate a solution to the problem.

Upvotes: 0

Answers (2)

Rick James

Reputation: 142472

"LATIN SMALL LETTER C is not considered to be the same as "CYRILLIC SMALL LETTER ES".
Ditto for "CYRILLIC SMALL LETTER I" and "CYRILLIC SMALL LETTER SHORT I"
MySQL's REGEXP works with bytes not characters. hence only unaccented English letters work in REGEXP; no Cyrillic letter can (reliably) work.
MariaDB 10.0.5's REGEXP should do a better job. Ref: https://mariadb.com/kb/en/mariadb/pcre/

Upvotes: 1

Nik

Reputation: 9452

I suggest switching to Postgres database which handles non-latin symbols pretty good.

Just tried to reproduce your issue on my Django 1.10 and Postgres 9.6 setup.

from django.contrib.auth.models import User users = User.objects.filter(username__iregex='Сосницки(и|й)') users <QuerySet [<User: Сосницкий>, <User: сосницкий>, <User: сосницкии>, <User: СоСницкии>]>

Seems to be working.

Upvotes: 0

django: iregex is case sensitive

Answers (2)

Related Questions