Crosstab using multi-element calculation

Question

I would like to create a crosstab from a dataframe df, comparing each record of df to each other, i.e. pairwise, and calculate one number from several elements of the rows of df. As an example, let's take the following dataframe and calculate the (squared) distance between the points:

import pandas as pd

df = pd.DataFrame({"Point": ["A", "B", "C"], "x": [10, 20, 30], "y": [1, 2, 3]})

df["XX"] = 1

result = (
    df.merge(df, on="XX")
    .assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
    .loc[:, ["Point_x", "Point_y", "distance"]]
    .pivot(index="Point_x", columns="Point_y")
)

yielding the desired result:

        distance       
Point_y        A  B   C
Point_x                
A              0  5  20
B              5  0   5
C             20  5   0

Is there a better way to do this without resorting to adding a dummy field XX and merging on it? I tried multiple variations of

df = df.drop("XX", axis=1)

result = pd.crosstab(index=df["Point"], columns=df["Point"])

with values= and aggfunc= parameters, but to no avail. Possibly there is also an easier way using numpy?

tdy · Accepted Answer

"cross" merge

Assuming 1.2.0+, you can avoid the dummy XX column by merging with how="cross":

cross: creates the cartesian product from both frames, preserves the order of the left keys (new in version 1.2.0)

(df.merge(df, how="cross")
   .assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
   .loc[:, ["Point_x", "Point_y", "distance"]]
   .pivot(index="Point_x", columns="Point_y"))

#         distance          
# Point_y        A    B    C
# Point_x                   
# A              0  101  404
# B            101    0  101
# C            404  101    0

numpy broadcasting

You can do the pairwise calculations in numpy by using singleton dimensions (None or np.newaxis):

x = (df.x.values[:, None] - df.x.values) ** 2
y = (df.y.values[:, None] - df.y.values) ** 2
pd.DataFrame(x + y, index=df.Point, columns=df.Point)

# Point    A    B    C
# Point               
# A        0  101  404
# B      101    0  101
# C      404  101    0

scipy squareform

If you compute a vector of pairwise values (e.g., result of pdist), you can use squareform to crosstab the vector:

from scipy.spatial.distance import squareform, pdist
pd.DataFrame(squareform(pdist(df[["x", "y"]]) ** 2), columns=df.Point, index=df.Point)

# Point      A      B      C
# Point                     
# A        0.0  101.0  404.0
# B      101.0    0.0  101.0
# C      404.0  101.0    0.0

Crosstab using multi-element calculation

Answers (2)

"cross" merge

numpy broadcasting

scipy squareform

Related Questions