How do you match strings with different values in pandas?

Question

I'm trying to compare the values in 2 dataframes. This is my code :

for i in df1['Searches']:
    for j in df['Tags']:
        if  i == j:
           print(i,j)

The code works. However, I want to account for cases where the strings don't entirely match, due to spacing, misspelling, or punctuation, but they should match given how much they have in common.

For instance:

   Searches       |   Tags 
----------------------------------
   lightblue      |   light blue
   light-blue     |   light blue
   light blu      |   light blue
   lite blue      |   light blue
   liteblue       |   light blue
   liteblu        |   light blue
   light b l u e  |   light blue
   light.blue     |   light blue
   l i ght blue   |   light blue

I listed variations of possible strings that could show up under searches, and the string that it should match to under tags. Is there a way to account for those variations and still have them match?

Thank you for taking the time to read my question and help in any way you can.

Code Different · Accepted Answer

You are getting into fuzzy string matching. One way to do that is to use a similarity metric such as jaro_similarity from the Natural Language Toolkit (NLTK):

from nltk.metrics.distance import jaro_similarity
df['jaro_similarity'] = df.apply(lambda row: jaro_similarity(row['Searches'], row['Tags']), axis=1)

Result:

     Searches       Tags  jaro_similarity
    lightblue light blue         0.966667
   light-blue light blue         0.933333
    light blu light blue         0.966667
    lite blue light blue         0.896296
     liteblue light blue         0.858333
      liteblu light blue         0.819048
light b l u e light blue         0.923077
   light.blue light blue         0.933333
 l i ght blue light blue         0.877778

You have to pick a cut-off point by experimenting on your data. Documentation on the nltk.metrics.distance module: https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance

How do you match strings with different values in pandas?

Answers (2)

Related Questions