Reputation: 129
I have a huge dataframe in a form
23/12/2020 15:38 23/12/2020 15:39 23/12/2020 15:40
1 [12,10] [15,10]
2 [52, 21] [14,7]
3 [1,0] [10,14]
Values in a dataframe correspond to geographical coordinates. I want to have a new dataframe which states the average speed between the dates and coordinates. I have written a function distance(wsp1, wsp2)
which takes as an input 2 lists of coordinates and return the distance between them. I also have a function which calculates a difference between two dates diff_between_dates(date1, date2)
. My question is how can I create a new dataframe which will have in the first column an average speed between the first 2 non empty cells in a row, in the second column an average speed between the second and third non empty cells in a row and so on. So in my example here the desired output would be:
1
1 distance([15,10],[12,10])/diff_between_dates(23/12/2020 15:40,23/12/2020 15:39)
2 distance([52,21],[14,7])/diff_between_dates(23/12/2020 15:40,23/12/2020 15:38)
3 distance([10,14],[1,0])/diff_between_dates(23/12/2020 15:39,23/12/2020 15:38)
Upvotes: 0
Views: 175
Reputation: 5601
import io
import pandas as pd
import numpy as np
# create sample df
df_str = '''
23/12/2020 15:38;23/12/2020 15:39;23/12/2020 15:40
;[12,10];[15,10]
[52, 21];;[14,7]
[1,0];[10,14];
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep=';')
df.index = [1,2,3]
df = df.applymap(lambda x: eval(x) if isinstance(x, str) else x)
print(df)
# 23/12/2020 15:38 23/12/2020 15:39 23/12/2020 15:40
# 1 NaN [12, 10] [15, 10]
# 2 [52, 21] NaN [14, 7]
# 3 [1, 0] [10, 14] NaN
# stack -> DataFrame to Series -> reset_index
dfn = df.stack().reset_index()
dfn.columns = ['idx', 'time', 'coor']
print(dfn)
# idx time coor
# 0 1 23/12/2020 15:39 [12, 10]
# 1 1 23/12/2020 15:40 [15, 10]
# 2 2 23/12/2020 15:38 [52, 21]
# 3 2 23/12/2020 15:40 [14, 7]
# 4 3 23/12/2020 15:38 [1, 0]
# 5 3 23/12/2020 15:39 [10, 14]
# convert datatime
dfn['time'] = pd.to_datetime(dfn['time'])
print(dfn)
# idx time coor
# 0 1 2020-12-23 15:39:00 [12, 10]
# 1 1 2020-12-23 15:40:00 [15, 10]
# 2 2 2020-12-23 15:38:00 [52, 21]
# 3 2 2020-12-23 15:40:00 [14, 7]
# 4 3 2020-12-23 15:38:00 [1, 0]
# 5 3 2020-12-23 15:39:00 [10, 14]
# sort values by idx and time
dfn.sort_values(['idx', 'time'], inplace=True)
# def sample function
def distance(x,y):
res = abs(x[0] - x[1]) ** 2 + abs(y[0] - y[1]) ** 2
res = np.sqrt(res)
return res
def diff_between_dates(x,y):
time_diff = abs(pd.to_datetime(x) - pd.to_datetime(y))
return time_diff.seconds
# calculate by split dfn by group using idx
res_dict = dict()
for idx, group in dfn.groupby('idx'):
# can modify by purpose
dates = group.iloc[:2, 1].tolist()
coors = group.iloc[:2, 2].tolist()
# calculate use function
dist_diff = distance(*coors)
dates_diff = diff_between_dates(*dates)
result = dist_diff/dates_diff
res_dict[idx] = result
obj_res = pd.Series(res_dict)
print(obj_res)
# 1 0.089753
# 2 0.264837
# 3 0.068718
# dtype: float64
Upvotes: 1