fabio
fabio

Reputation: 1365

Comparing string uniforming special characters in python

Probabily I can use a better english but what I want is ignoring accent (and like) in words so:

renè, rené, rene' and rene should be the same so should

mañana and manana or

even-distribuited and even distribuited and possibly

shouldn't and shouldnt

I remember a function (derivated from journalism) used for example for internet page addresses that should take out spaces, accent etc but I don't remember the name. I think it should works but other way are accepted

Thank you

Edit:

The function I had in mind is Slugfy() for Django but probabily is not enough

Upvotes: 1

Views: 72

Answers (1)

Romain
Romain

Reputation: 21958

The standard approach to get rid of special chars seems to be discussed in this question. But maybe you could consider another approach often called fuzzy matching (or fuzzy search).

[...] technique of finding strings that match a pattern approximately (rather than exactly)

In Python you can use TheFuzz to do that. Here is a try based on your examples.

from thefuzz import fuzz

tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]

for tuple in tuples:
  print(f"{tuple[0]} vs {tuple[1]}: {fuzz.ratio(tuple[0], tuple[1])}")

# mañana vs manana: 83
# shouldn't vs shouldnt: 94
# even-distribuited vs even distribuited: 94

So you could define a rule based on the ratio to conclude that there is a match between two strings.


You could even combine unicode normalization and fuzzy matching for better results.

tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]

def compare(tuples, unicode=True):
  for t in tuples:
    if unicode:
      t = tuple(map(lambda x: unicodedata.normalize(u'NFKD', x).encode('ascii', 'ignore').decode('utf8'), t))
    print(f"{t[0]} vs {t[1]}: {fuzz.ratio(t[0], t[1])}")

compare(tuples)

# manana vs manana: 100
# shouldn't vs shouldnt: 94
# even-distribuited vs even distribuited: 94

Upvotes: 3

Related Questions