Creating a matrix from Pandas dataframe to display connectedness

Question

I have my data in this format in a pandas dataframe:

Customer_ID  Location_ID
Alpha             A
Alpha             B
Alpha             C
Beta              A
Beta              B
Beta              D

I want to study the mobility patterns of the customers. My goal is to determine the clusters of locations that are most frequented by customers. I think the following matrix can provide such information:

   A  B  C  D
A  0  2  1  1
B  2  0  1  1
C  1  1  0  0
D  1  1  0  0

How do I do so in Python?

My dataset is quite large (hundreds of thousands of customers and about a hundred locations).

hilberts_drinking_problem · Accepted Answer

Here is one approach that takes into account the multiplicity of visits (e.g. if Customer X visits both LocA and LocB twice, he will contribute 2 to the corresponding position in the final matrix).

Idea:

For each location, count visits by customer.
For each location pair, find the sum of minimal numbers of visits for each customer who visited both.
Use unstack and cleanup.

Counter plays nicely here because counters support many natural arithmetic operations, like add, max etc.

import pandas as pd
from collections import Counter
from itertools import product

df = pd.DataFrame({
    'Customer_ID': ['Alpha', 'Alpha', 'Alpha', 'Beta', 'Beta'],
    'Location_ID': ['A', 'B', 'C', 'A', 'B'],
    })


ctrs = {location: Counter(gp.Customer_ID) for location, gp in df.groupby('Location_ID')}


# In [7]: q.ctrs
# Out[7]:
# {'A': Counter({'Alpha': 1, 'Beta': 1}),
#  'B': Counter({'Alpha': 1, 'Beta': 1}),
#  'C': Counter({'Alpha': 1})}


ctrs = list(ctrs.items())
overlaps = [(loc1, loc2, sum(min(ctr1[k], ctr2[k]) for k in ctr1))
    for i, (loc1, ctr1) in enumerate(ctrs, start=1)
    for (loc2, ctr2) in ctrs[i:] if loc1 != loc2]
overlaps += [(l2, l1, c) for l1, l2, c in overlaps]


df2 = pd.DataFrame(overlaps, columns=['Loc1', 'Loc2', 'Count'])
df2 = df2.set_index(['Loc1', 'Loc2'])
df2 = df2.unstack().fillna(0).astype(int)


#      Count
# Loc2     A  B  C
# Loc1
# A        0  2  1
# B        2  0  1
# C        1  1  0

If you like to disregard multiplicities, replace Counter(gp.Customer_ID) with Counter(set(gp.Customer_ID)).

Creating a matrix from Pandas dataframe to display connectedness

Answers (2)

Related Questions