Reputation: 792
I have a dataset with two columns: customer id
and addresses
:
id addresses
1111 asturias 32, benito juarez, CDMX
1111 JOSE MARIA VELASCO, CDMX
1111 asturias 32 DEPT 401, INSURGENTES, CDMX
1111 deportes
1111 asturias 32, benito juarez, MIXCOAC, CDMX
1111 cd. de los deportes
1111 deportes, wisconsin
2222 TORRE REFORMA LATINO, CDMX
2222 PERISUR 2890
2222 WE WORK, CDMX
2222 WEWORK, TORRE REFORMA LATINO, CDMX
2222 PERISUR: 2690, COYOCAN
2222 TORRE REFORMA LATINO
I am interested to find number of different addresses for each customers. For example, for the customer id
1111, there are 3 different addresses:
[asturias 32, benito juarez, CDMX,
asturias 32 DEPT 401, INSURGENTES, CDMX,
asturias 32, benito juarez, MIXCOAC, CDMX]
[JOSE MARIA VELASCO, CDMX]
[deportes,
cd. de los deportes,
deportes, wisconsin]
I wrote a code in python which can only show similarity between two consecutive rows: row i
and row i+1
(score of 0 means completely dissimilar and 1 means completely similar).
id addresses score
1111 asturias 32, benito juarez, CDMX 0
1111 JOSE MARIA VELASCO, CDMX 0
1111 asturias 32 DEPT 401, INSURGENTES, CDMX 0
1111 deportes 0
1111 asturias 32, benito juarez, MIXCOAC, CDMX 0
1111 cd. de los deportes 0.21
1111 deportes, wisconsin 0
2222 TORRE REFORMA LATINO, CDMX 0
2222 PERISUR 2890 0
2222 WE WORK, CDMX 0.69
2222 WEWORK, TORRE REFORMA LATINO, CDMX 0
2222 PERISUR: 2690, COYOCAN 0
2222 TORRE REFORMA LATINO
If score > 0.20, I am considering them two different addresses. Following is my code:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata
import unidecode
import string
from sklearn.feature_extraction.text import TfidfVectorizer
data=pd.read_csv('address.csv')
nltk.download('punkt')
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(
nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0, 1]
cnt = np.array(np.arange(0, 5183))
indx = []
for i in cnt:
print cosine_sim(data['address'][i], data['address'][i + 1])
But above code is not able to compare each possible rows for a particular customer id
. I want output like below:
id unique address
1111 3
2222 3
3333 2
Upvotes: 1
Views: 133
Reputation: 401
You can use combinations in itertools for this purpose. Please see the complate code below.
Please note that I used semicolon separated CSV file
Also, you can use similarity
function in SPACY
to find similarity between two phrases if you want. Here I used the same function provided by you.
import nltk
import numpy as np
import pandas as pd
import itertools
import string
from sklearn.feature_extraction.text import TfidfVectorizer
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(
nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0, 1]
def group_addresses(addresses):
'''merge the lists if they have an element in common'''
out = []
while len(addresses)>0:
# first, *rest = addresses # for python 3
first, rest = addresses[0], addresses[1:] # for python2
first = set(first)
lf = -1
while len(first)>lf:
lf = len(first)
rest2 = []
for r in rest:
if len(first.intersection(set(r)))>0:
first |= set(r)
else:
rest2.append(r)
rest = rest2
out.append(first)
addresses = rest
return out
df=pd.read_csv("address.csv", sep=";")
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
sim_df = pd.DataFrame(columns=['id', 'unique address'])
for customer in set(df['id']):
customer_addresses = (df.loc[df['id'] == customer]['addresses']) #Get the addresses of each customer
all_entries = [[adr] for adr in customer_addresses] #Make list of lists
sim_pairs = [list((text1, text2)) for text1, text2 in itertools.combinations(customer_addresses, 2) if cosine_sim(text1, text2) >0.2 ] # Find all pairs whose similiarty is greater than 0.2
all_entries.extend(sim_pairs)
sim_pairs = group_addresses(all_entries)
print(customer , len(sim_pairs))
Output looks like
2222 2
1111 3
The groups formed are
2222
['WE WORK, CDMX', 'WEWORK, TORRE REFORMA LATINO, CDMX', 'TORRE REFORMA LATINO, CDMX', 'TORRE REFORMA LATINO']
['PERISUR 2890', 'PERISUR: 2690, COYOCAN']
1111
['asturias 32 DEPT 401, INSURGENTES, CDMX', 'asturias 32, benito juarez, MIXCOAC, CDMX', 'asturias 32, benito juarez, CDMX']
['JOSE MARIA VELASCO, CDMX']
['deportes, wisconsin', 'cd. de los deportes', 'deportes']
Upvotes: 1