Optimizing compare row operation in pandas/python

Question

I have a large pandas dataframe that in simplified form looks like this:

names = ['P1', 'P2', 'P3']
clusters = [1, 1, 2]

df = pd.DataFrame(clusters, names).reset_index()
df.columns=['names', 'cluster']
print(df)

  names  cluster
0    P1        1
1    P2        1
2    P3        2

I want to create a new dataframe or array, df_, which looks like the following:

names  P1  P2  P3
names            
P1      1   1   0
P2      1   1   0
P3      0   0   1

Where the cell values indciate whether each pair (P1/P2, P1/P3, P2/P3, etc.) have the same "cluster" value in the original dataframe (df).

I have been able to achieve this by brute force using the iterrows function:

df_ = pd.DataFrame(index=df['names'], columns=df['names'])
df_ = df_.fillna(0)
for index, row in df.iterrows():
    for index2, row2 in df.iterrows():
        if row['cluster'] == row2['cluster']:
            df_.iloc[index, index2] += 1
        else: 
            continue

But my actual data is very large (2500 rows), which makes this prohibitively slow. I know that vectorization or lambda functions would be preferable for performance reasons, but I am unsure how to start, if there are pandas functions I am not aware of that might be useful, or if there are libraries other than pandas which might be more amenable to this problem. Any hints would be much appreciated.

ALollz · Accepted Answer

You can merge and then use .crosstab

import pandas as pd

m = df.merge(df, on='cluster')
pd.crosstab(m.names_x, m.names_y)

names_y  P1  P2  P3
names_x            
P1        1   1   0
P2        1   1   0
P3        0   0   1

If you need this to just be a boolean for the pairing instead of the count then add .clip(upper=1) to the end.

Optimizing compare row operation in pandas/python

Answers (2)

Related Questions