Peter Jung
Peter Jung

Reputation: 255

Remove diacritics from string for search function

I am developing a simple web page with Django and I need to implement search function. I am currently using something like this:

search_box = request.GET['search_box']
X = Foo.objects.filter(Q(title__contains=search_box) | Q(info__contains=search_box)).values()

It checks my database if specified columns contains searched string, but what if I search for "kočík" but my database contains "kocik". How I can remove diacritis from string in Python 3, or what is the best way to implement this? Thanks

Upvotes: 1

Views: 928

Answers (1)

mjwunderlich
mjwunderlich

Reputation: 1035

You can use unicodedata package for that.

import unicodedata
def shave_marks(txt):
    """This method removes all diacritic marks from the given string"""
    norm_txt = unicodedata.normalize('NFD', txt)
    shaved = ''.join(c for c in norm_txt if not unicodedata.combining(c))
    return unicodedata.normalize('NFC', shaved)

Some details about this algorithm:

The main problem with diacritics is that in UTF-8 some are combining characters that modify the preceding character, and yet others are included with the character. For example, 'café' and 'cafe/u0301' look the same.

From https://docs.python.org/2/library/unicodedata.html:

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

This algorithm first decomposes a string (using 'NFD' method) so that all diacritics become combining characters, then it filters out all combining characters, and lastly composes the string (using 'NFC' method).

Upvotes: 6

Related Questions