Reputation: 255
I am developing a simple web page with Django and I need to implement search function. I am currently using something like this:
search_box = request.GET['search_box']
X = Foo.objects.filter(Q(title__contains=search_box) | Q(info__contains=search_box)).values()
It checks my database if specified columns contains searched string, but what if I search for "kočík" but my database contains "kocik". How I can remove diacritis from string in Python 3, or what is the best way to implement this? Thanks
Upvotes: 1
Views: 928
Reputation: 1035
You can use unicodedata
package for that.
import unicodedata
def shave_marks(txt):
"""This method removes all diacritic marks from the given string"""
norm_txt = unicodedata.normalize('NFD', txt)
shaved = ''.join(c for c in norm_txt if not unicodedata.combining(c))
return unicodedata.normalize('NFC', shaved)
Some details about this algorithm:
The main problem with diacritics is that in UTF-8 some are combining characters that modify the preceding character, and yet others are included with the character. For example, 'café'
and 'cafe/u0301'
look the same.
From https://docs.python.org/2/library/unicodedata.html:
Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.
This algorithm first decomposes a string (using 'NFD' method) so that all diacritics become combining characters, then it filters out all combining characters, and lastly composes the string (using 'NFC' method).
Upvotes: 6