Vémundr
Vémundr

Reputation: 405

django: iregex is case sensitive

Hitting the db (MySQL) with these two queries one right after another I get different results:

test1 = Agreement.objects.filter(pk=152, company__iregex='СитиСтро(и|й)')
test2 = Agreement.objects.filter(pk=152, company__iregex='ситистро(и|й)')

test1 <QuerySet [<Agreement: Agreement object>]>
test2 <QuerySet []>

with the actual value if the field ' "СитиСтрой" '

Now i'm pretty sure that is Cyrillics that is messing things up, because with records in Latin alphabet it works fine, but I have no idea how to go around that (bug?). Any advice here?

PS I did double check, there is no confusion here with similar looking C letters of English and Russian, but with different letter codes.

Update: Checked the sql that Django sends to Mysql.

('SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company`, '
 'FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`company` REGEXP '
 'СитиСтро(и|й) AND `dbbs_app_agreement`.`id` = 152)')

Seems fine. Tried querying the table directly from phpmyadmin with

SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'С')

which worked, but

SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'с')

at the same time does not.

As @AndreyShipilov below offered, made a new table in the db from scratch with utf8_unicode_ci collation, inserted there the value in question (ООО "СитиСтрой") and tried these two queries from phpmyadmin:

SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'с')
SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'С')

Second one works, first one does not. Really weird.

update2 My initial code that formed the query looked like that:

query_ka_name = reduce(operator.and_,
(Q(company__iregex=r'(([^\w]|^){i}([^\w]|$))'.format(i=re.sub(r'и|й', '(и|й)', item, flags=re.IGNORECASE)))

the purpose of that being to check if a db record corresponded to the array of keywords recognized from a scan as a company name. Since the scanner is really bad with differentiating й from и, and db records are beyond my control I added that little thing to consider these letters as one.

Now the code looks like that:

query_ka_name = reduce(operator.and_, (Q(company__iregex=tambourine(item)) for item in ka_name_listed))

def tambourine(string):
    string = re.sub(r'и|й', '(и|й)', string, flags=re.IGNORECASE)
    output = ''
    for char in string:
        if char.isalpha():
            output = '{o}({u}|{l})'.format(o=output, u=char.upper(), l=char.lower())
        else:
            output = '{o}{c}'.format(o=output, c=char)
    output = r'(([^\w]|^){i}([^\w]|$))'.format(i=output)
    return output

that is probably slow as hell in comparison, but at least it works. Would still greatly appreciate a solution to the problem.

Upvotes: 0

Views: 254

Answers (2)

Rick James
Rick James

Reputation: 142472

  • "LATIN SMALL LETTER C is not considered to be the same as "CYRILLIC SMALL LETTER ES".
  • Ditto for "CYRILLIC SMALL LETTER I" and "CYRILLIC SMALL LETTER SHORT I"
  • MySQL's REGEXP works with bytes not characters. hence only unaccented English letters work in REGEXP; no Cyrillic letter can (reliably) work.
  • MariaDB 10.0.5's REGEXP should do a better job. Ref: https://mariadb.com/kb/en/mariadb/pcre/

Upvotes: 1

Nik
Nik

Reputation: 9452

I suggest switching to Postgres database which handles non-latin symbols pretty good.

Just tried to reproduce your issue on my Django 1.10 and Postgres 9.6 setup.

from django.contrib.auth.models import User users = User.objects.filter(username__iregex='Сосницки(и|й)') users <QuerySet [<User: Сосницкий>, <User: сосницкий>, <User: сосницкии>, <User: СоСницкии>]>

Seems to be working.

Upvotes: 0

Related Questions