divingTobi
divingTobi

Reputation: 2300

Crosstab using multi-element calculation

I would like to create a crosstab from a dataframe df, comparing each record of df to each other, i.e. pairwise, and calculate one number from several elements of the rows of df. As an example, let's take the following dataframe and calculate the (squared) distance between the points:

import pandas as pd

df = pd.DataFrame({"Point": ["A", "B", "C"], "x": [10, 20, 30], "y": [1, 2, 3]})

df["XX"] = 1

result = (
    df.merge(df, on="XX")
    .assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
    .loc[:, ["Point_x", "Point_y", "distance"]]
    .pivot(index="Point_x", columns="Point_y")
)

yielding the desired result:

        distance       
Point_y        A  B   C
Point_x                
A              0  5  20
B              5  0   5
C             20  5   0

Is there a better way to do this without resorting to adding a dummy field XX and merging on it? I tried multiple variations of

df = df.drop("XX", axis=1)

result = pd.crosstab(index=df["Point"], columns=df["Point"])

with values= and aggfunc= parameters, but to no avail. Possibly there is also an easier way using numpy?

Upvotes: 1

Views: 82

Answers (2)

perl
perl

Reputation: 9941

As another option, using euclidean_distances from sklearn:

from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances(
    df[['x', 'y']],
    df[['x', 'y']], squared=True)

Output:

array([[  0., 101., 404.],
       [101.,   0., 101.],
       [404., 101.,   0.]])

Upvotes: 2

tdy
tdy

Reputation: 41327

"cross" merge

Assuming 1.2.0+, you can avoid the dummy XX column by merging with how="cross":

cross: creates the cartesian product from both frames, preserves the order of the left keys (new in version 1.2.0)

(df.merge(df, how="cross")
   .assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
   .loc[:, ["Point_x", "Point_y", "distance"]]
   .pivot(index="Point_x", columns="Point_y"))

#         distance          
# Point_y        A    B    C
# Point_x                   
# A              0  101  404
# B            101    0  101
# C            404  101    0

numpy broadcasting

You can do the pairwise calculations in numpy by using singleton dimensions (None or np.newaxis):

x = (df.x.values[:, None] - df.x.values) ** 2
y = (df.y.values[:, None] - df.y.values) ** 2
pd.DataFrame(x + y, index=df.Point, columns=df.Point)

# Point    A    B    C
# Point               
# A        0  101  404
# B      101    0  101
# C      404  101    0

scipy squareform

If you compute a vector of pairwise values (e.g., result of pdist), you can use squareform to crosstab the vector:

from scipy.spatial.distance import squareform, pdist
pd.DataFrame(squareform(pdist(df[["x", "y"]]) ** 2), columns=df.Point, index=df.Point)

# Point      A      B      C
# Point                     
# A        0.0  101.0  404.0
# B      101.0    0.0  101.0
# C      404.0  101.0    0.0

Upvotes: 4

Related Questions