Noelle Held
Noelle Held

Reputation: 33

Optimizing compare row operation in pandas/python

I have a large pandas dataframe that in simplified form looks like this:

names = ['P1', 'P2', 'P3']
clusters = [1, 1, 2]

df = pd.DataFrame(clusters, names).reset_index()
df.columns=['names', 'cluster']
print(df)

  names  cluster
0    P1        1
1    P2        1
2    P3        2

I want to create a new dataframe or array, df_, which looks like the following:

names  P1  P2  P3
names            
P1      1   1   0
P2      1   1   0
P3      0   0   1

Where the cell values indciate whether each pair (P1/P2, P1/P3, P2/P3, etc.) have the same "cluster" value in the original dataframe (df).

I have been able to achieve this by brute force using the iterrows function:

df_ = pd.DataFrame(index=df['names'], columns=df['names'])
df_ = df_.fillna(0)
for index, row in df.iterrows():
    for index2, row2 in df.iterrows():
        if row['cluster'] == row2['cluster']:
            df_.iloc[index, index2] += 1
        else: 
            continue

But my actual data is very large (2500 rows), which makes this prohibitively slow. I know that vectorization or lambda functions would be preferable for performance reasons, but I am unsure how to start, if there are pandas functions I am not aware of that might be useful, or if there are libraries other than pandas which might be more amenable to this problem. Any hints would be much appreciated.

Upvotes: 3

Views: 90

Answers (2)

g.a
g.a

Reputation: 347

I would use xarray for the task and take advantage of xarray's automatic broadcasting feature.

import pandas as pd

names = ['P1', 'P2', 'P3']
clusters = [1, 1, 2]

df = pd.DataFrame(clusters, names).reset_index()
df.columns=['names_x', 'cluster']
da_x = df.set_index('names_x')['cluster'].to_xarray()

df.columns=['names_y', 'cluster']
da_y = df.set_index('names_y')['cluster'].to_xarray()

da= (da_x == da_y).astype(int)
print(da.to_dataframe('X').unstack(['names_y'])['X'])

What is going on here?

We use two different versions of the DataFrame, one where the name of the index column is names_x and an other where it is names_y.

We select the single column, cluster, from the DataFrame (turning it into a pandas.Series) then apply the to_xarray method, which turns the series into xarray.DataArray-s.

Now we have two DataArrays: da_x, and da_y, with different dimensions (names_x and names_y). If we now do any binary operation on this dataset (e.g. ==) then xarray automatically expands each DataArray along the missing dimensions before applying the operation itself.

Finally we need to turn it back to DataFrame and "pivot" using the unstack method.

So most of the code is actually preparation, the "magic" happens here:

da_x == da_y

Upvotes: 0

ALollz
ALollz

Reputation: 59529

You can merge and then use .crosstab

import pandas as pd

m = df.merge(df, on='cluster')
pd.crosstab(m.names_x, m.names_y)

names_y  P1  P2  P3
names_x            
P1        1   1   0
P2        1   1   0
P3        0   0   1

If you need this to just be a boolean for the pairing instead of the count then add .clip(upper=1) to the end.

Upvotes: 2

Related Questions