Reputation: 33
I have a large pandas dataframe that in simplified form looks like this:
names = ['P1', 'P2', 'P3']
clusters = [1, 1, 2]
df = pd.DataFrame(clusters, names).reset_index()
df.columns=['names', 'cluster']
print(df)
names cluster
0 P1 1
1 P2 1
2 P3 2
I want to create a new dataframe or array, df_, which looks like the following:
names P1 P2 P3
names
P1 1 1 0
P2 1 1 0
P3 0 0 1
Where the cell values indciate whether each pair (P1/P2, P1/P3, P2/P3, etc.) have the same "cluster" value in the original dataframe (df).
I have been able to achieve this by brute force using the iterrows function:
df_ = pd.DataFrame(index=df['names'], columns=df['names'])
df_ = df_.fillna(0)
for index, row in df.iterrows():
for index2, row2 in df.iterrows():
if row['cluster'] == row2['cluster']:
df_.iloc[index, index2] += 1
else:
continue
But my actual data is very large (2500 rows), which makes this prohibitively slow. I know that vectorization or lambda functions would be preferable for performance reasons, but I am unsure how to start, if there are pandas functions I am not aware of that might be useful, or if there are libraries other than pandas which might be more amenable to this problem. Any hints would be much appreciated.
Upvotes: 3
Views: 90
Reputation: 347
I would use xarray for the task and take advantage of xarray's automatic broadcasting feature.
import pandas as pd
names = ['P1', 'P2', 'P3']
clusters = [1, 1, 2]
df = pd.DataFrame(clusters, names).reset_index()
df.columns=['names_x', 'cluster']
da_x = df.set_index('names_x')['cluster'].to_xarray()
df.columns=['names_y', 'cluster']
da_y = df.set_index('names_y')['cluster'].to_xarray()
da= (da_x == da_y).astype(int)
print(da.to_dataframe('X').unstack(['names_y'])['X'])
What is going on here?
We use two different versions of the DataFrame, one where the name of the index column is names_x
and an other where it is names_y
.
We select the single column, cluster
, from the DataFrame (turning it into a pandas.Series) then apply the to_xarray
method, which turns the series into xarray.DataArray-s.
Now we have two DataArrays: da_x
, and da_y
, with different dimensions (names_x
and names_y
).
If we now do any binary operation on this dataset (e.g. ==) then xarray automatically expands each DataArray along the missing dimensions before applying the operation itself.
Finally we need to turn it back to DataFrame and "pivot" using the unstack method.
So most of the code is actually preparation, the "magic" happens here:
da_x == da_y
Upvotes: 0
Reputation: 59529
You can merge
and then use .crosstab
import pandas as pd
m = df.merge(df, on='cluster')
pd.crosstab(m.names_x, m.names_y)
names_y P1 P2 P3
names_x
P1 1 1 0
P2 1 1 0
P3 0 0 1
If you need this to just be a boolean
for the pairing instead of the count then add .clip(upper=1)
to the end.
Upvotes: 2