Darragh MacKenna
Darragh MacKenna

Reputation: 1440

Calculating and using Euclidean Distance in Python

I am trying to calculate the Euclidean Distance between two datasets in python. I can do this using the following:

np.linalg.norm(df-signal)

With df and signal being my two datasets. This returns a single numerical value (i.e, 8258155.579535276), which is fine. My issue is that I want it to return the difference between each column in the dataset. Something like this:

AFNLWGT     4.867376e+10
AGI         3.769233e+09
EMCONTRB    1.202935e+07
FEDTAX      8.095078e+07
PTOTVAL     2.500056e+09
STATETAX    1.007451e+07
TAXINC      2.027124e+09
POTHVAL     1.158428e+08
INTVAL      1.606913e+07
PEARNVAL    2.038357e+09
FICA        1.080950e+07
WSALVAL     1.986075e+09
ERNVAL      1.905109e+09

I'm fairly new to Python so would really appreciate any help possible.

Upvotes: 1

Views: 204

Answers (1)

FBruzzesi
FBruzzesi

Reputation: 6475

To have the columnwise norm with column headers you can use pandas.DataFrame.aggregate together with np.linalg.norm:

import pandas as pd
import numpy as np

norms = (df-signal).aggregate(np.linalg.norm)

Notice that, by default, .aggregate operates along the 0-axis (hence columns).

However this will be much slower than the numpy implementation:

norms = pd.Series(np.linalg.norm(df.to_numpy()-signal.to_numpy(), axis=0), 
                  index=df.columns)

With test data of size 100x2, the latter is 20x faster.

Upvotes: 2

Related Questions