shorttriptomars
shorttriptomars

Reputation: 325

How do you match strings with different values in pandas?

I'm trying to compare the values in 2 dataframes. This is my code :

for i in df1['Searches']:
    for j in df['Tags']:
        if  i == j:
           print(i,j)   

The code works. However, I want to account for cases where the strings don't entirely match, due to spacing, misspelling, or punctuation, but they should match given how much they have in common.

For instance:

   Searches       |   Tags 
----------------------------------
   lightblue      |   light blue
   light-blue     |   light blue
   light blu      |   light blue
   lite blue      |   light blue
   liteblue       |   light blue
   liteblu        |   light blue
   light b l u e  |   light blue
   light.blue     |   light blue
   l i ght blue   |   light blue
  

I listed variations of possible strings that could show up under searches, and the string that it should match to under tags. Is there a way to account for those variations and still have them match?

Thank you for taking the time to read my question and help in any way you can.

Upvotes: 2

Views: 112

Answers (2)

jollibobert
jollibobert

Reputation: 333

You can use a string similarity metric to determine a match. For example, here I use edit_distance from the nltk library:

import pandas as pd
from nltk.metrics.distance import edit_distance

searches = \
['lightblue',
 'light-blue',
 'light blu',
 'lite blue',
 'liteblue',
 'liteblu',
 'light b l u e',
 'light.blue',
 'l i ght blue',
 'totally different string'
]
df = pd.DataFrame()
df['Searches'] = searches
df['Tags'] = 'light blue'

matches = []
distance_threshold = 5
for i in df['Searches']:
    for j in df['Tags']:
        if  edit_distance(i, j) < distance_threshold:
            # print(i,j)  
            matches.append(i)
print(list(set(matches))) 

Output:

['light.blue',
 'light b l u e',
 'lightblue',
 'light blu',
 'light-blue',
 'lite blue',
 'l i ght blue',
 'liteblu',
 'liteblue']

But you will have to adjust distance_threshold to your liking or choose another metric that works better. See a list of metrics here:

https://www.nltk.org/api/nltk.metrics.distance.html

There are many other libraries that you can try as well. Just give it a search.

Upvotes: 1

Code Different
Code Different

Reputation: 93151

You are getting into fuzzy string matching. One way to do that is to use a similarity metric such as jaro_similarity from the Natural Language Toolkit (NLTK):

from nltk.metrics.distance import jaro_similarity
df['jaro_similarity'] = df.apply(lambda row: jaro_similarity(row['Searches'], row['Tags']), axis=1)

Result:

     Searches       Tags  jaro_similarity
    lightblue light blue         0.966667
   light-blue light blue         0.933333
    light blu light blue         0.966667
    lite blue light blue         0.896296
     liteblue light blue         0.858333
      liteblu light blue         0.819048
light b l u e light blue         0.923077
   light.blue light blue         0.933333
 l i ght blue light blue         0.877778

You have to pick a cut-off point by experimenting on your data. Documentation on the nltk.metrics.distance module: https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance

Upvotes: 3

Related Questions