Reputation: 2300
I would like to create a crosstab from a dataframe df
, comparing each record of df
to each other, i.e. pairwise, and calculate one number from several elements of the rows of df
. As an example, let's take the following dataframe and calculate the (squared) distance between the points:
import pandas as pd
df = pd.DataFrame({"Point": ["A", "B", "C"], "x": [10, 20, 30], "y": [1, 2, 3]})
df["XX"] = 1
result = (
df.merge(df, on="XX")
.assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
.loc[:, ["Point_x", "Point_y", "distance"]]
.pivot(index="Point_x", columns="Point_y")
)
yielding the desired result:
distance
Point_y A B C
Point_x
A 0 5 20
B 5 0 5
C 20 5 0
Is there a better way to do this without resorting to adding a dummy field XX
and merging on it? I tried multiple variations of
df = df.drop("XX", axis=1)
result = pd.crosstab(index=df["Point"], columns=df["Point"])
with values=
and aggfunc=
parameters, but to no avail. Possibly there is also an easier way using numpy?
Upvotes: 1
Views: 82
Reputation: 9941
As another option, using euclidean_distances
from sklearn
:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(
df[['x', 'y']],
df[['x', 'y']], squared=True)
Output:
array([[ 0., 101., 404.],
[101., 0., 101.],
[404., 101., 0.]])
Upvotes: 2
Reputation: 41327
Assuming 1.2.0+, you can avoid the dummy XX
column by merging with how="cross"
:
cross
: creates the cartesian product from both frames, preserves the order of the left keys (new in version 1.2.0)
(df.merge(df, how="cross")
.assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
.loc[:, ["Point_x", "Point_y", "distance"]]
.pivot(index="Point_x", columns="Point_y"))
# distance
# Point_y A B C
# Point_x
# A 0 101 404
# B 101 0 101
# C 404 101 0
You can do the pairwise calculations in numpy by using singleton dimensions (None
or np.newaxis
):
x = (df.x.values[:, None] - df.x.values) ** 2
y = (df.y.values[:, None] - df.y.values) ** 2
pd.DataFrame(x + y, index=df.Point, columns=df.Point)
# Point A B C
# Point
# A 0 101 404
# B 101 0 101
# C 404 101 0
If you compute a vector of pairwise values (e.g., result of pdist
), you can use squareform
to crosstab the vector:
from scipy.spatial.distance import squareform, pdist
pd.DataFrame(squareform(pdist(df[["x", "y"]]) ** 2), columns=df.Point, index=df.Point)
# Point A B C
# Point
# A 0.0 101.0 404.0
# B 101.0 0.0 101.0
# C 404.0 101.0 0.0
Upvotes: 4