Parsyk
Parsyk

Reputation: 331

iretate over columns in df and calculate euclidean distance with one column in pandas?

I have a dataset with several columns (Time Series) and I would like to synchronize them - the 'col2' should be the reference.

Here is an example with two time series:
enter image description here

Here is my df:
enter image description here

With the code below I am able to synchronize the only two columns 'col3' according to 'col2' (time series).

import pandas as pd
import numpy as np
# pip install fastdtw  



df=pd.DataFrame({'ID':range(0,25), 'col2':np.random.randn(25)+3, 'col3':np.random.randn(25)+3,'col4':np.random.randn(25)+3,'col5':np.random.randn(25)+3})
from fastdtw import *
from scipy.spatial.distance import *

x = np.array(df['col2'].fillna(0))
y = np.array(df['col3'].fillna(0))

distance, path = fastdtw(x, y, dist=euclidean)

result = []

for i in range(0,len(path)):
    result.append([df['ID'].iloc[path[i][0]],
    df['col2'].iloc[path[i][0]],
    df['col3'].iloc[path[i][1]]])
    
df_synchronized = pd.DataFrame(data=result,columns=['ID','col2','col3']).dropna()
df_synchronized = df_synchronized.drop_duplicates(subset=['ID'])
df_synchronized = df_synchronized.sort_values(by='ID')
df_synchronized = df_synchronized.reset_index(drop=True)
df_synchronized.head(n=3) 

Here is the df_synchronized:
enter image description here

I would like to iterate over all columns in DataFrame and do the same for 'col4' and 'col5' as was for 'col3' being done. Simply, 'col3' needs to be replaced in a loop with 'col4' and 'col5'. The goal would be to have the df_synchronized with all columns from df.

Is there any way, how to make it done?

distance, path = fastdtw(x, y, dist=euclidean)

can't be change to distance, path = fastdtw(x, y, z, aa, dist=euclidean). 'Synchronization' needs to be done on one column, then save into df_synchronized, then with next column...

Upvotes: 1

Views: 342

Answers (1)

user1673010
user1673010

Reputation: 61

This can be done by picking one Time series as a "reference" and then run distance, path = fastdtw(ref, x) for all other time series and collect the alignment paths (path) from each run.

With all of these time series aligned to a common reference you can create a global alignment that allows a data point from any one of the time series to be matched to its corresponding data point in all of the other time series.

This will work vey well as long as all of the time series are somewhat similar to each other. Ideally the "reference" time series will be the most average/normal (but not required). Finding the most "average" time series is possible by aligning each time series to all/most of the others and the time series with the smallest average distance is the most "average" time series in the set.

An example of this was performed in this paper. See section 6.2 for a description and page 104 has a picture showing the results of multiple time series aligned together. That paper took an extra step of "merging" the time series together after the global alignment.

Upvotes: 1

Related Questions