Reputation: 37
I am learning Text Processing and am stuck I have a dataset of survey about which website a user spends his money on while shopping
i have data of the form : amazon,amzn ,amazon prime,amazon.com ,amzn prim,etc
Now i want to create a dictionary which clubs the similar values under one key like
dict1 = {"AMAZON":["amazon","amzn","amazon prime","amazon.com", "amzn prim"],
"Coursera" : ["coursera","corsera","coursera.org","coursera.com"]}
The main goal of the dictionary is to create another column in the dataframe with the key value of each website name
I have tried fuzzywuzzy but am unable to understand how to club the similar values under one key
Thanks :)
Upvotes: 1
Views: 285
Reputation: 1506
Your task is to associate a response str
with the correct key str
from a list of pre-defined keys. Therefore, you need to compare a given str
response (e.g. "amazon.com"
) with each of your keys ["AMAZON", "Coursera"]
and pick the key
that displays the highest similarity with respect to some metric.
1. Manual choice of Keys
Choosing a suitable metric on strings is the tricky part as they merely treat them as arrays of characters. No consideration is given to the semantics of the words and no domain knowledge is involved. In turn, I'd suggest a manual matching if the number of keys is low. Python's built-in string class str
provides lower()
to make the comparison invariant invariant the in
-Operator checks for membership of a substring. This is a good starting point.
def getKey(website:str):
# case-insensitive
website = website.lower()
# 1. handcrafted key-pattern matching
refDict = dict()
refDict['AMAZON'] = ["amzn", "amazon"]
refDict['COURSERA'] = ["coursera", "amazon"]
for k,v in refDict.items():
if any([pattern in website for pattern in v]):
return k
# if no match found
return ""
For a Pandas frame this yields
df = pd.DataFrame({'website': ['amazon', 'amzn.com', 'coursera', 'corsera', 'cosera', 'save-the-amazon-forest.org']})
df['key'] = [getKey(website) for website in df['website']]
>df
As you can see, this string comparison is inherently brittle, too. In addition, the order of the keys in the dictionary matters. Note that only since Python 3.6, dictionaries maintain insertion order by default. If you use an earlier version using OrderedDict
to keep control of the order.
If you can enforce users to write the proper URL, you might want to consider extracting it from the string via regular expression and use it directly as the key. This would save you the time to list keys and matching patterns manually in getKey()
altogether. It is presented in here.
2. Automatic keys via unsupervised learning
Since the additional requirement was raised that the algorithm needs to find the keys in an unsupervised fashion, the following code invokes the Edit (Levenshtein) distance and clustering to do exactly that.
import pandas as pd
import numpy as np
from sklearn.cluster import AffinityPropagation
from nltk.cluster.kmeans import KMeansClusterer
from nltk.metrics import *
# example input
websiteList = ["amazon", "apple.com", "amzn", "amazon prime" , "amazon.com", "cosera", "apple inc.", "amzn prim", "coursera",
"coursera", "coursera.org", "coursera.com", "StackOverFlow.com", "stackoverflow", "stack-overflow.com",
"corsing", "apple", "AAPL"]
websiteListRaw = list(websiteList) # copy for later
df = pd.DataFrame({'website' : websiteList})
def minEditDistance(s1, s2):
'''Minimum edit distance across all pairwise input (sub-)strings'''
ptrList_1 = s1.split(' ') + [s1]
ptrList_2 = s2.split(' ') + [s2]
return min([edit_distance(x_i, x_j) for x_i in ptrList_1 for x_j in ptrList_2])
# lowercase
websiteList = [site.lower() for site in websiteList]
N = len(websiteList)
# delete suffixes
suffixList = ['.com', '.org', 'co.uk', '.eu']
for i in range(N):
for suffix in suffixList:
websiteList[i] = websiteList[i].removesuffix(suffix)
# replace special characters
specialSymbolList = ['/', '-', '*']
for i in range(N):
for symbol in specialSymbolList:
websiteList[i] = websiteList[i].replace(symbol, ' ')
# similarity = -1 * distance
responses = np.array(websiteList)
minEditSimilarity = (-1.0)*np.array([[minEditDistance(w1,w2) for w1 in responses] for w2 in responses])
# clustering
affprop = AffinityPropagation(affinity="precomputed", damping=0.54, random_state=77)
affprop.fit(minEditSimilarity)
# return
matchDict = dict()
for cluster_id in np.unique(affprop.labels_):
exemplar = responses[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(responses[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
# assign
for resp in cluster:
match_indices = [i for i, name in enumerate(websiteList) if name==resp]
for resp_index in match_indices:
matchDict[websiteListRaw[resp_index]] = exemplar.split(' ')[0].upper()
print('exemplar: ', exemplar)
# add learned keys
df['key'] = df['website'].replace(matchDict)
df
Upvotes: 1