kurious
kurious

Reputation: 1044

Creating a matrix from Pandas dataframe to display connectedness

I have my data in this format in a pandas dataframe:

Customer_ID  Location_ID
Alpha             A
Alpha             B
Alpha             C
Beta              A
Beta              B
Beta              D

I want to study the mobility patterns of the customers. My goal is to determine the clusters of locations that are most frequented by customers. I think the following matrix can provide such information:

   A  B  C  D
A  0  2  1  1
B  2  0  1  1
C  1  1  0  0
D  1  1  0  0

How do I do so in Python?

My dataset is quite large (hundreds of thousands of customers and about a hundred locations).

Upvotes: 0

Views: 1237

Answers (2)

mark s.
mark s.

Reputation: 656

I'm sure there's a more elegant way but here's a solution I came up with on the fly. Basically you build an adjacency list for each customer, then update the adjacency matrix accordingly:

import pandas as pd

#I'm assuming you can get your data into a pandas data frame:
data = {'Customer_ID':[1,1,1,2,2],'Location':['A','B','C','A','B']}
df = pd.DataFrame(data)

#Initialize an empty matrix
matrix_size = len(df.groupby('Location'))
matrix = [[0 for col in range(matrix_size)] for row in range(matrix_size)]

#To make life easier, I made a map to go from locations 
#to row/col positions in the matrix
location_set = list(set(df['Location'].tolist()))
location_set.sort()
location_map = dict(zip(location_set,range(len(location_set))))

#Group data by customer, and create an adjacency list (dyct) for each
#Update the matrix accordingly
for name,group in df.groupby('Customer_ID'):
    locations = set(group['Location'].tolist())
    dyct = {}
    for i in locations:
        dyct[i] = list(locations.difference(i))

    #Loop through the adjacency list and update matrix
    for node, edges in dyct.items(): 
        for edge in edges:
            matrix[location_map[edge]][location_map[node]] +=1

Upvotes: 1

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

Here is one approach that takes into account the multiplicity of visits (e.g. if Customer X visits both LocA and LocB twice, he will contribute 2 to the corresponding position in the final matrix).

Idea:

  1. For each location, count visits by customer.
  2. For each location pair, find the sum of minimal numbers of visits for each customer who visited both.
  3. Use unstack and cleanup.

Counter plays nicely here because counters support many natural arithmetic operations, like add, max etc.

import pandas as pd
from collections import Counter
from itertools import product

df = pd.DataFrame({
    'Customer_ID': ['Alpha', 'Alpha', 'Alpha', 'Beta', 'Beta'],
    'Location_ID': ['A', 'B', 'C', 'A', 'B'],
    })


ctrs = {location: Counter(gp.Customer_ID) for location, gp in df.groupby('Location_ID')}


# In [7]: q.ctrs
# Out[7]:
# {'A': Counter({'Alpha': 1, 'Beta': 1}),
#  'B': Counter({'Alpha': 1, 'Beta': 1}),
#  'C': Counter({'Alpha': 1})}


ctrs = list(ctrs.items())
overlaps = [(loc1, loc2, sum(min(ctr1[k], ctr2[k]) for k in ctr1))
    for i, (loc1, ctr1) in enumerate(ctrs, start=1)
    for (loc2, ctr2) in ctrs[i:] if loc1 != loc2]
overlaps += [(l2, l1, c) for l1, l2, c in overlaps]


df2 = pd.DataFrame(overlaps, columns=['Loc1', 'Loc2', 'Count'])
df2 = df2.set_index(['Loc1', 'Loc2'])
df2 = df2.unstack().fillna(0).astype(int)


#      Count
# Loc2     A  B  C
# Loc1
# A        0  2  1
# B        2  0  1
# C        1  1  0

If you like to disregard multiplicities, replace Counter(gp.Customer_ID) with Counter(set(gp.Customer_ID)).

Upvotes: 1

Related Questions