Reputation: 1
Task: I am new to python and currently working on a clustering task where I compute the similarity between users clickstreams. Therefore I am using the Jaccard Index to compare the click sets (clickstream) of each two users, saving the results in an NxN distance matrix and then performing Wards clustering algorithm on this distance matrix.
Problem: Having tried out everything with data from 1 day (around 85 Session IDs/users), it worked like a charm. Now with 949 unique users, the computation takes ages, probably due to my inefficient code.
Here is a snapshot from my df stack_dataframe: 9329 rows x 2 columns
Here is my code for computing the distance matrix:
import itertools
import pandas as pd
# Method to compute Jaccard similarity index between two sets
def compute_jaccard(session1_vals, session2_vals):
intersection = session1_vals.intersection(session2_vals)
union = session1_vals.union(session2_vals)
jaccard = len(intersection)/float(len(union))
return jaccard
stID_keys = stack_dataframe.groupby(['Session ID']).groups.keys()
print("hallo1")
New_stack_df = stack_dataframe.pivot(columns="Session ID", values="Page")
print("hallo2")
sim_df = pd.DataFrame(columns=ID_keys, index=ID_keys)
# Iterate through columns and compute metric
test = 0
print("hallo3")
for col_pair in itertools.combinations(New_stack_df.columns, 2):
print(test)
test += 1
u1= col_pair[0]
u2 = col_pair[1]
sim_df.loc[col_pair] = compute_jaccard(set(New_stack_df[u1].dropna()),
set(New_stack_df[u2].dropna()))
print(sim_df)
Any help much appreciated, thanks!
Upvotes: 0
Views: 1445
Reputation: 2910
Your method is highly inefficient. The inefficiency is mainly due to two reasons :
itertools.combinations(..)
We solve these by
So the code is:
from __future__ import division
import time
import pandas as pd
import numpy as np
from scipy.spatial import distance
import numba as nb
def _make_2D_array(lis):
n = len(lis)
lengths = np.array([len(x) for x in lis])
max_len = max(lengths)
arr = np.zeros((n, max_len))
for i in range(n):
arr[i, :lengths[i]] = lis[i]
return arr, lengths
@nb.jit(nopython=True, cache=True)
def compute_jaccard(session1, session2, arr, lengths):
"""Jited funciton to calculate jaccard distance
"""
session1, session2 = session1[0], session2[0]
intersection, union = 0, 0
if(lengths[session2] > lengths[session1]):
session1, session2 = session2, session1
marked = np.zeros((lengths[session2],))
for x in arr[session1][:lengths[session1]]:
x_in_2 = arr[session2][:lengths[session2]] == x
marked[x_in_2] = 1
if(np.any(x_in_2)):
intersection+=1
union+=1
else:
union+=1
union+=np.sum(marked==0)
jaccard = intersection/union
return jaccard
def calculate_sim_between(stack_dataframe):
# get integer encodings for session ids and pages
session_encode, sessions = pd.factorize(stack_dataframe["Session ID"])
page_encode, pages = pd.factorize(stack_dataframe["Page"])
# take unique pages in each session
pages_in_sessions = [np.unique(page_encode[session_encode==x]) for x in range(len(sessions))]
# convert the list of lists to numpy array
arr, lengths = _make_2D_array(pages_in_sessions)
# make a dummy array like [[0], [1], [2] ...] to get the distances between every pair of sessions
_sessions = np.arange(len(sessions))[:, np.newaxis]
# get the distances
distances = distance.cdist(_sessions, _sessions, compute_jaccard, arr=arr, lengths=lengths)
sim_df = pd.DataFrame(distances, columns=sessions, index=sessions)
return sim_df
Notice that, we numba compile compute_jaccard function to eke out event the single level loop time. If you don't want to install numba, just comment out the decorator.
On this sample data:
from faker import Faker
fake = Faker()
stack_dataframe = pd.DataFrame({"Session ID":[fake.name() for i in range(200)], "Page":[fake.name() for i in range(200)]})
timings are
Your method : 69.465s
Without jit : 0.374s
With jit(on second run, to discount for the compile time) : 0.147s
P.S : Since we use fake data to run on a largish sample to observe the speedup, on your actual data, the timing profile may be slightly different.
Upvotes: 2