Mamed
Mamed

Reputation: 772

Hausdorff distance between rows of 2 column

Given a data frame:

df = 

    car     lat    lon
0   0   22.0397 3.6531
1   1   22.0367 3.5095
2   2   22.0713 3.5346
3   3   22.1249 3.5922

I have calculated the euclidean distance to get matrix:

from scipy.spatial.distance import squareform, pdist

pd.DataFrame(squareform(pdist(df.iloc[:, 1:])), columns=df1.car.unique(), index=df1.car.unique())

Now I want to get Hausdorff Distance and get the matrix.


I tried:

def hausdorff(p, q):
    p = p #Need to choose row
    q = q #Need to choose row
    return hausdorff_distance(p, q, distance="euclidean")

distance_df = squareform(pdist(df1.values, hausdorff))
euclidean = pd.DataFrame(distance_df)

Upvotes: 1

Views: 552

Answers (1)

Stef
Stef

Reputation: 30579

There's no need to choose rows, this does pdist for you. It calls the user-supplied function for all row combinations. So just supply the row vectors to hausdorff. The only caveat is that hausdorff_distance expects two 2-dimensional arrays as input, so you need to reshape them.

def hausdorff(p, q):
    p = p.reshape(-1,2)
    q = q.reshape(-1,2)
    return hausdorff_distance(p, q, distance="euclidean")

pd.DataFrame(squareform(pdist(df.iloc[:, 1:], hausdorff)), columns=df.car.unique(), index=df.car.unique())

Result:

          0         1         2         3
0  0.000000  0.143631  0.122641  0.104728
1  0.143631  0.000000  0.042745  0.120907
2  0.122641  0.042745  0.000000  0.078681
3  0.104728  0.120907  0.078681  0.000000


The above just answers the technical question of how to use a user-defined function with pdist. Depending on what you're trying to achieve I guess you'll need to supply arrays with more than just one row, e.g. all rows for a given car as in the following example:

import itertools as it

df1 = pd.DataFrame({'car': [0,0,1,1,2,2], 'lat': 22+pd.np.random.rand(6), 'lon': 3+pd.np.random.rand(6)})
#   car        lat       lon
#0    0  22.426797  3.006383
#1    0  22.894152  3.558360
#2    1  22.657756  3.969983
#3    1  22.788719  3.969007
#4    2  22.025103  3.854048
#5    2  22.867389  3.760920

cars = df1.car.unique()
p = []
for c in it.combinations(cars, 2):
    p.append(hausdorff_distance( df1.loc[df1.car==c[0],['lat','lon']].to_numpy(), df1.loc[df1.car==c[1],['lat','lon']].to_numpy()))
pd.DataFrame(squareform(p), columns=cars, index=cars)

Result:

          0         1         2
0  0.000000  0.990892  0.917975
1  0.990892  0.000000  0.643188
2  0.917975  0.643188  0.000000

Please note however that the Hausdorff distance is a directed distance, i.e. h(x,y) != h(y,x). hausdorff_distance computes the maximum of h(x,y) and h(y,x), so you can't populate the distance matrix from it. You can use directed_hausdorff for correctly creating the distance matrix.

Upvotes: 2

Related Questions