Reputation: 281
I want to insert a new column called total
in final_df
which is a cumulative sum of value
in df
if it occurs between the times in final_df
. It sums the values if it occurs between the start
and end
in final_df
. So for example during the time range 01:30 to 02:00 in final_df
- both index 0 and 1 in df
occur between this time range so the total is 15 (10+5).
I have two pandas dataframes:
df
import pandas as pd
d = {'start_time': ['01:00','00:00','00:30','02:00'],
'end_time': ['02:00','03:00','01:30','02:30'],
'value': ['10','5','20','5']}
df = pd.DataFrame(data=d)
final_df
final_df = {'start_time': ['00:00, 00:30, 01:00, 01:30, 02:00, 02:30'],
'end_time': ['00:30, 01:00, 01:30, 02:00, 02:30, 03:00']}
final_df = pd.DataFrame(data=final_d)
output I want final_df
start_time end_time total
00:00 00:30 5
00:30 01:00 25
01:00 01:30 35
01:30 02:00 15
02:30 03:00 10
My try
final_df['total'] = final_df.apply(lambda x: df.loc[(df['start_time'] >= x.start_time) &
(df['end_time'] <= x.end_time), 'value'].sum(), axis=1)
Problem 1
I get the error: TypeError: ("'>=' not supported between instances of 'str' and 'datetime.time'", 'occurred at index 0')
I converted the relevant columns to datetime as follows:
df[['start_time','end_time']] = df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M')
final_df[['start_time','end_time']] = final_df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M:%S')
But I don't want to convert to datetime. Is there a way around this?
Problem 2
The sum is not working properly. It's only looking for exact match for the time range. So the output is:
start_time end_time total
00:00 00:30 0
00:30 01:00 0
01:00 01:30 0
01:30 02:00 0
02:30 03:00 5
Upvotes: 5
Views: 1356
Reputation: 29635
One way to not use apply
could be like this this.
df_ = (df.rename(columns={'start_time':1, 'end_time':-1}) #to use in the calculation later
.rename_axis(columns='mult') # mostly for esthetic
.set_index('value').stack() #reshape the data
.reset_index(name='time') # put the index back to columns
)
df_ = (df_.set_index(pd.to_datetime(df_['time'], format='%H:%M')) #to use resampling technic
.assign(total=lambda x: x['value'].astype(float)*x['mult']) #get plus or minus the value depending start/end
.resample('30T')[['total']].sum() # get the sum at the 30min bounds
.cumsum() #cumulative sum from the beginning
)
# create the column for merge with final resul
df_['start_time'] = df_.index.strftime('%H:%M')
# merge
final_df = final_df.merge(df_)
and you get
print (final_df)
start_time end_time total
0 00:00 00:30 5.0
1 00:30 01:00 25.0
2 01:00 01:30 35.0
3 01:30 02:00 15.0
4 02:00 02:30 10.0
5 02:30 03:00 5.0
But if you want to use apply, first you need to ensure that the columns are the good dtype and then you did the inegality in the reverse order like:
df['start_time'] = pd.to_datetime(df['start_time'], format='%H:%M')
df['end_time'] = pd.to_datetime(df['end_time'], format='%H:%M')
df['value'] = df['value'].astype(float)
final_df['start_time'] = pd.to_datetime(final_df['start_time'], format='%H:%M')
final_df['end_time'] = pd.to_datetime(final_df['end_time'], format='%H:%M')
final_df.apply(
lambda x: df.loc[(df['start_time'] <= x.start_time) & #see other inequality
(df['end_time'] >= x.end_time), 'value'].sum(), axis=1)
0 5.0
1 25.0
2 35.0
3 15.0
4 10.0
5 5.0
dtype: float64
Upvotes: 3