Mamed
Mamed

Reputation: 772

Iterate the code in a shortest way for the whole dataset

I have very big df:

df.shape() = (106, 3364)

I want to calculate so called frechet distance by using this Frechet Distance between 2 curves. And it works good. Example:

x = df['1']
x1 = df['1.1']
p = np.array([x, x1])

y = df['2']
y1 = df['2.1']
q = np.array([y, y1])

P_final = list(zip(p[0], p[1]))
Q_final = list(zip(q[0], q[1]))

from frechetdist import frdist

frdist(P_final,Q_final)

But I can not do row by row like:

`1 and 1.1` to `1 and 1.1` which is equal to 0
`1 and 1.1` to `2 and 2.1` which is equal to some number
...
`1 and 1.1` to `1682 and 1682.1` which is equal to some number

I want to create something (first idea is for loop, but maybe you have better solution) to calculate this frdist(P_final,Q_final) between:

Finally, I supposed to get a matrix size (106,106) with 0 on diagonal (because distance between itself is 0)

matrix =

  0 1 2 3 4 5 ... 105
0 0
1   0
2     0
3       0  
4         0
5           0
...           0
105              0

Not including my trial code because it is confusing everyone!

EDITED: Sample data:

    1           1.1     2           2.1     3           3.1     4           4.1     5           5.1
0   43.1024     6.7498  45.1027     5.7500  45.1072     3.7568  45.1076     8.7563  42.1076     8.7563
1   46.0595     1.6829  45.0595     9.6829  45.0564     4.6820  45.0533     8.6796  42.0501     3.6775
2   25.0695     5.5454  44.9727     8.6660  41.9726     2.6666  84.9566     3.8484  44.9566     1.8484
3   35.0281     7.7525  45.0322     3.7465  14.0369     3.7463  62.0386     7.7549  65.0422     7.7599
4   35.0292     7.5616  45.0292     4.5616  23.0292     3.5616  45.0292     7.5616  25.0293     7.5613

Upvotes: 0

Views: 112

Answers (1)

Ric Hard
Ric Hard

Reputation: 609

I just used own sample data in your format (I hope)

import pandas as pd
from frechetdist import frdist
import numpy as np

# create sample data
df = pd.DataFrame([[1,2,3,4,5,6], [3,4,5,6,8,9], [2,3,4,5,2,2], [3,4,5,6,7,3]], columns=['1','1.1','2', '2.1', '3', '3.1'])

# this matrix will hold the result
res = np.ndarray(shape=(df.shape[1] // 2, df.shape[1] // 2), dtype=np.float32)

for row in range(res.shape[0]):
    for col in range(row, res.shape[1]):

        # extract the two functions
        P = [*zip([df.loc[:, f'{row+1}'], df.loc[:, f'{row+1}.1']])]
        Q = [*zip([df.loc[:, f'{col+1}'], df.loc[:, f'{col+1}.1']])]

        # calculate distance
        dist = frdist(P, Q)

        # put result back (its symmetric)
        res[row, col] = dist
        res[col, row] = dist

# output
print(res)

Output:

[[0.        4.        7.5498343]
 [4.        0.        5.5677643]
 [7.5498343 5.5677643 0.       ]]

Hope that helps

EDIT: Some general tips:

  • If speed matters: check if frdist handles also a numpy array of shape (n_values, 2) than you could save the rather expensive zip-and-unpack operation and directly use the arrays or build the data directly in a format the your library needs

  • Generally, use better column namings (3 and 3.1 is not too obvious). Why you dont call them x3, y3 or x3 and f_x3

  • I would actually put the data into two different Matrices. If you watch the code I had to do some not-so-obvious stuff like iterating over shape divided by two and built indices from string operations because of the given table layout

Upvotes: 1

Related Questions