Kev.Pr
Kev.Pr

Reputation: 1

How to speed up Distance Matrix calculation?

Task: I am new to python and currently working on a clustering task where I compute the similarity between users clickstreams. Therefore I am using the Jaccard Index to compare the click sets (clickstream) of each two users, saving the results in an NxN distance matrix and then performing Wards clustering algorithm on this distance matrix.

Problem: Having tried out everything with data from 1 day (around 85 Session IDs/users), it worked like a charm. Now with 949 unique users, the computation takes ages, probably due to my inefficient code.

Here is a snapshot from my df stack_dataframe: 9329 rows x 2 columns

Here is my code for computing the distance matrix:

import itertools
import pandas as pd

# Method to compute Jaccard similarity index between two sets
def compute_jaccard(session1_vals, session2_vals):
    intersection = session1_vals.intersection(session2_vals)
    union = session1_vals.union(session2_vals)
    jaccard = len(intersection)/float(len(union))
    return jaccard


stID_keys = stack_dataframe.groupby(['Session ID']).groups.keys()
print("hallo1")
New_stack_df = stack_dataframe.pivot(columns="Session ID", values="Page")
print("hallo2")
sim_df = pd.DataFrame(columns=ID_keys, index=ID_keys)

# Iterate through columns and compute metric
test = 0
print("hallo3")
for col_pair in itertools.combinations(New_stack_df.columns, 2):
   print(test)
   test += 1
   u1= col_pair[0]
   u2 = col_pair[1]
   sim_df.loc[col_pair] = compute_jaccard(set(New_stack_df[u1].dropna()), 
   set(New_stack_df[u2].dropna()))


print(sim_df)

Any help much appreciated, thanks!

Upvotes: 0

Views: 1445

Answers (1)

Deepak Saini
Deepak Saini

Reputation: 2910

Your method is highly inefficient. The inefficiency is mainly due to two reasons :

  • The O(n^2) loop in itertools.combinations(..)
  • Heavy pandas usage. Though easy to use, pandas is slightly inefficient due to lot of book keeping.

We solve these by

  • Using scipy distance.cdist(source written in c) to calculate the distances between all pairs.
  • Use numpy instead of pandas
  • Jit compile the jaccard distance function as it is being called large number of times.

So the code is:

from __future__ import division
import time
import pandas as pd
import numpy as np
from scipy.spatial import distance
import numba as nb

def _make_2D_array(lis):
    n = len(lis)
    lengths = np.array([len(x) for x in lis])
    max_len = max(lengths)
    arr = np.zeros((n, max_len))
    for i in range(n):
        arr[i, :lengths[i]] = lis[i]
    return arr, lengths

@nb.jit(nopython=True, cache=True)
def compute_jaccard(session1, session2, arr, lengths):
    """Jited funciton to calculate jaccard distance
    """
    session1, session2 = session1[0], session2[0]
    intersection, union = 0, 0

    if(lengths[session2] > lengths[session1]):
        session1, session2 = session2, session1

    marked = np.zeros((lengths[session2],))

    for x in arr[session1][:lengths[session1]]:
        x_in_2 = arr[session2][:lengths[session2]] == x
        marked[x_in_2] = 1
        if(np.any(x_in_2)):
            intersection+=1
            union+=1
        else:
            union+=1

    union+=np.sum(marked==0)

    jaccard = intersection/union

    return jaccard

def calculate_sim_between(stack_dataframe):
    # get integer encodings for session ids and pages
    session_encode, sessions = pd.factorize(stack_dataframe["Session ID"])
    page_encode, pages = pd.factorize(stack_dataframe["Page"])

    # take unique pages in each session 
    pages_in_sessions = [np.unique(page_encode[session_encode==x]) for x in range(len(sessions))]

    # convert the list of lists to numpy array
    arr, lengths = _make_2D_array(pages_in_sessions)

    # make a dummy array like [[0], [1], [2] ...] to get the distances between every pair of sessions
    _sessions = np.arange(len(sessions))[:, np.newaxis]

    # get the distances
    distances = distance.cdist(_sessions, _sessions, compute_jaccard, arr=arr, lengths=lengths)

    sim_df = pd.DataFrame(distances, columns=sessions, index=sessions)
    return sim_df

Notice that, we numba compile compute_jaccard function to eke out event the single level loop time. If you don't want to install numba, just comment out the decorator.

Timings:

On this sample data:

from faker import Faker
fake = Faker()
stack_dataframe = pd.DataFrame({"Session ID":[fake.name() for i in range(200)], "Page":[fake.name() for i in range(200)]})

timings are

Your method : 69.465s

Without jit : 0.374s

With jit(on second run, to discount for the compile time) : 0.147s

P.S : Since we use fake data to run on a largish sample to observe the speedup, on your actual data, the timing profile may be slightly different.

Upvotes: 2

Related Questions