wychen
wychen

Reputation: 129

How to apply a function to the next non empty cell in a row in a dataframe

I have a huge dataframe in a form

          23/12/2020 15:38     23/12/2020 15:39   23/12/2020 15:40

1                                [12,10]           [15,10]              
2           [52, 21]                               [14,7]
3           [1,0]                [10,14]

Values in a dataframe correspond to geographical coordinates. I want to have a new dataframe which states the average speed between the dates and coordinates. I have written a function distance(wsp1, wsp2) which takes as an input 2 lists of coordinates and return the distance between them. I also have a function which calculates a difference between two dates diff_between_dates(date1, date2). My question is how can I create a new dataframe which will have in the first column an average speed between the first 2 non empty cells in a row, in the second column an average speed between the second and third non empty cells in a row and so on. So in my example here the desired output would be:

       1   
1    distance([15,10],[12,10])/diff_between_dates(23/12/2020 15:40,23/12/2020 15:39)
2    distance([52,21],[14,7])/diff_between_dates(23/12/2020 15:40,23/12/2020 15:38)
3    distance([10,14],[1,0])/diff_between_dates(23/12/2020 15:39,23/12/2020 15:38)

Upvotes: 0

Views: 175

Answers (1)

Ferris
Ferris

Reputation: 5601

import io
import pandas as pd
import numpy as np


# create sample df
df_str = '''
23/12/2020 15:38;23/12/2020 15:39;23/12/2020 15:40
;[12,10];[15,10]              
[52, 21];;[14,7]
[1,0];[10,14];
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep=';')
df.index = [1,2,3]
df = df.applymap(lambda x: eval(x) if isinstance(x, str) else x)
print(df)

#   23/12/2020 15:38 23/12/2020 15:39 23/12/2020 15:40
# 1              NaN         [12, 10]         [15, 10]
# 2         [52, 21]              NaN          [14, 7]
# 3           [1, 0]         [10, 14]              NaN

# stack -> DataFrame to Series -> reset_index
dfn = df.stack().reset_index()
dfn.columns = ['idx', 'time', 'coor']
print(dfn)

#    idx              time      coor
# 0    1  23/12/2020 15:39  [12, 10]
# 1    1  23/12/2020 15:40  [15, 10]
# 2    2  23/12/2020 15:38  [52, 21]
# 3    2  23/12/2020 15:40   [14, 7]
# 4    3  23/12/2020 15:38    [1, 0]
# 5    3  23/12/2020 15:39  [10, 14]

# convert datatime
dfn['time'] = pd.to_datetime(dfn['time'])

print(dfn)
#    idx                time      coor
# 0    1 2020-12-23 15:39:00  [12, 10]
# 1    1 2020-12-23 15:40:00  [15, 10]
# 2    2 2020-12-23 15:38:00  [52, 21]
# 3    2 2020-12-23 15:40:00   [14, 7]
# 4    3 2020-12-23 15:38:00    [1, 0]
# 5    3 2020-12-23 15:39:00  [10, 14]

# sort values by idx and time
dfn.sort_values(['idx', 'time'], inplace=True)

# def sample function
def distance(x,y):
    res = abs(x[0] - x[1]) ** 2 + abs(y[0] - y[1]) ** 2
    res = np.sqrt(res)
    return res

def diff_between_dates(x,y):
    time_diff = abs(pd.to_datetime(x) - pd.to_datetime(y))
    return time_diff.seconds


# calculate by split dfn by group using idx
res_dict = dict()
for idx, group in dfn.groupby('idx'):
    # can modify by purpose
    dates = group.iloc[:2, 1].tolist()
    coors = group.iloc[:2, 2].tolist()
    
    # calculate use function 
    dist_diff = distance(*coors)
    dates_diff = diff_between_dates(*dates)

    result = dist_diff/dates_diff
    res_dict[idx] = result

obj_res = pd.Series(res_dict)

print(obj_res)

# 1    0.089753
# 2    0.264837
# 3    0.068718
# dtype: float64

Upvotes: 1

Related Questions