Reputation: 325
I'm trying to compare the values in 2 dataframes. This is my code :
for i in df1['Searches']:
for j in df['Tags']:
if i == j:
print(i,j)
The code works. However, I want to account for cases where the strings don't entirely match, due to spacing, misspelling, or punctuation, but they should match given how much they have in common.
For instance:
Searches | Tags
----------------------------------
lightblue | light blue
light-blue | light blue
light blu | light blue
lite blue | light blue
liteblue | light blue
liteblu | light blue
light b l u e | light blue
light.blue | light blue
l i ght blue | light blue
I listed variations of possible strings that could show up under searches, and the string that it should match to under tags. Is there a way to account for those variations and still have them match?
Thank you for taking the time to read my question and help in any way you can.
Upvotes: 2
Views: 112
Reputation: 333
You can use a string similarity metric to determine a match. For example, here I use edit_distance
from the nltk
library:
import pandas as pd
from nltk.metrics.distance import edit_distance
searches = \
['lightblue',
'light-blue',
'light blu',
'lite blue',
'liteblue',
'liteblu',
'light b l u e',
'light.blue',
'l i ght blue',
'totally different string'
]
df = pd.DataFrame()
df['Searches'] = searches
df['Tags'] = 'light blue'
matches = []
distance_threshold = 5
for i in df['Searches']:
for j in df['Tags']:
if edit_distance(i, j) < distance_threshold:
# print(i,j)
matches.append(i)
print(list(set(matches)))
Output:
['light.blue',
'light b l u e',
'lightblue',
'light blu',
'light-blue',
'lite blue',
'l i ght blue',
'liteblu',
'liteblue']
But you will have to adjust distance_threshold
to your liking or choose another metric that works better. See a list of metrics here:
https://www.nltk.org/api/nltk.metrics.distance.html
There are many other libraries that you can try as well. Just give it a search.
Upvotes: 1
Reputation: 93151
You are getting into fuzzy string matching. One way to do that is to use a similarity metric such as jaro_similarity
from the Natural Language Toolkit (NLTK):
from nltk.metrics.distance import jaro_similarity
df['jaro_similarity'] = df.apply(lambda row: jaro_similarity(row['Searches'], row['Tags']), axis=1)
Result:
Searches Tags jaro_similarity
lightblue light blue 0.966667
light-blue light blue 0.933333
light blu light blue 0.966667
lite blue light blue 0.896296
liteblue light blue 0.858333
liteblu light blue 0.819048
light b l u e light blue 0.923077
light.blue light blue 0.933333
l i ght blue light blue 0.877778
You have to pick a cut-off point by experimenting on your data. Documentation on the nltk.metrics.distance
module: https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance
Upvotes: 3