Juan Carlos
Juan Carlos

Reputation: 367

Pandas: Query using Levenshtein Distance

Given the following DataSet:

name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50

and the constraints:

source = "john"
max_dist = 2

my goal is to get a list of all name values having a Levenshtein Distance with the source that is <= max_dist. Is it possible to do this by using the pandas.DataFrame.query() method or it has to be done in a different way?

Upvotes: 3

Views: 2387

Answers (1)

gold_cy
gold_cy

Reputation: 14226

You would do it a different way.

import editdistance # first do pip install editdistance
from StringIO import StringIO

s = StringIO("""name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50""")

df = pd.read_csv(s, sep=';')

df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)]

   name   sex     city  age
0  john  male  newyork   20


df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)].name.tolist()

['john']

Upvotes: 3

Related Questions