Reputation: 49
If I have a dataframe (df_data) like:
ID Time X Y Z H
05 2020-06-26 14:13:16 0.055 0.047 0.039 0.062
05 2020-06-26 14:13:21 0.063 0.063 0.055 0.079
05 2020-06-26 14:13:26 0.063 0.063 0.063 0.079
05 2020-06-26 14:13:31 0.095 0.102 0.079 0.127
... .. ... ... ... ... ... ...
01 2020-07-01 08:59:43 0.063 0.063 0.047 0.079
01 2020-07-01 08:59:48 0.055 0.055 0.055 0.079
01 2020-07-01 08:59:53 0.071 0.063 0.055 0.082
01 2020-07-01 08:59:58 0.063 0.063 0.047 0.082
01 2020-07-01 08:59:59 0.047 0.047 0.047 0.071
[17308709 rows x 8 columns]
which I want to filter by another dataframe of intervals (df_intervals), like:
int_id start end
1 2020-02-03 18:11:59 2020-02-03 18:42:00
2 2020-02-03 19:36:59 2020-02-03 20:06:59
3 2020-02-03 21:00:59 2020-02-03 21:31:00
4 2020-02-03 22:38:00 2020-02-03 23:08:00
5 2020-02-04 05:55:00 2020-02-04 06:24:59
... ... ...
1804 2021-01-10 13:50:00 2021-01-10 14:20:00
1805 2021-01-10 18:10:00 2021-01-10 18:40:00
1806 2021-01-10 19:40:00 2021-01-10 20:10:00
1807 2021-01-10 21:25:00 2021-01-10 21:55:00
1808 2021-01-10 22:53:00 2021-01-10 23:23:00
[1808 rows x 2 columns]
what is the most efficient way to do so? I have a large dataset and if I try to iterate over it like:
for i in range(len(intervals)):
df_filtered = df[df['Time'].between(intervals['start'][i], intervals['end'][i])
...
...
...
it takes forever! I know that I shouldn't iterate over large dataframes, but I have no idea how I could filter it by every interval on the second dataframe.
The steps I'm trying to do are:
1- Get all the intervals (start/end columns) from df_intervals;
2- Use those intervals to create a new dataframe (df_stats) containing the statistics of the columns within those time ranges. Example:
start end ID X_max X_min X_mean Y_max Y_min Y_mean ....
2020-02-03 18:11:59 2020-02-03 18:42:00 01 ... ... ... ... ... ... ... ...
2020-02-03 18:11:59 2020-02-03 18:42:00 02 ... ... ... ... ... ... ... ...
2020-02-03 18:11:59 2020-02-03 18:42:00 03 ... ... ... ... ... ... ... ...
2020-02-03 18:11:59 2020-02-03 18:42:00 04 ... ... ... ... ... ... ... ...
2020-02-03 18:11:59 2020-02-03 18:42:00 05 ... ... ... ... ... ... ... ...
2020-02-03 19:36:59 2020-02-03 20:06:59 01 ... ... ... ... ... ... ... ...
2020-02-03 19:36:59 2020-02-03 20:06:59 02 ... ... ... ... ... ... ... ...
2020-02-03 19:36:59 2020-02-03 20:06:59 03 ... ... ... ... ... ... ... ...
Upvotes: 3
Views: 1216
Reputation: 211
If Joe's answer does not give you the speed you desire I think it can be improved by eliminating the stats calculation in your for loop. ( I'm stealing his df creation because he is a hero for putting that in his answer.) Ideally, you could get rid of the for loop too, but I think with the timestamp indices being duplicated (across the ID numbers) it could be tricky to merge these two data frames.
Here is my attempt still using the iteration to handle the start/end times. First, I apply the int_id to the parent df. I want to add this to the parent dataframe so I can 'groupby' without making a 'temp' dataframe and doing my statistics to that.
for index, row in df2.iterrows():
df1.loc[df1.Time.between(row.start,row.end), 'int_id'] = row.int_id
ID Time X Y Z H int_id
0 01 2020-02-03 18:13:16 0.011 0.012 0.013 0.014 1.0
1 01 2020-02-03 18:13:21 0.015 0.016 0.017 0.018 1.0
2 01 2020-02-03 18:13:26 0.013 0.013 0.013 0.013 1.0
3 01 2020-02-03 18:13:31 0.015 0.015 0.015 0.015 1.0
4 02 2020-02-03 18:13:16 0.021 0.022 0.023 0.024 1.0
5 02 2020-02-03 18:13:21 0.025 0.026 0.027 0.028 1.0
6 02 2020-02-03 18:13:26 0.023 0.023 0.023 0.023 1.0
Then I define the aggregations to get it all in one go AFTER the loop is finished.
aggs = {'X':['sum', 'max', 'mean', 'median'],
'Y':['sum', 'max', 'mean', 'median'],
'Z':['sum', 'max', 'mean', 'median'],
'H':['sum', 'max', 'mean', 'median']}
df2 = df1.groupby(by=('int_id')).agg(aggs)
X Y Z H
sum max mean median sum max mean median sum max mean median sum max mean median
int_id
1.0 0.732 0.095 0.0366 0.034 0.739 0.102 0.03695 0.034 0.708 0.079 0.0354 0.034 0.827 0.127 0.04135 0.0345
2.0 0.732 0.095 0.0366 0.034 0.739 0.102 0.03695 0.034 0.708 0.079 0.0354 0.034 0.827 0.127 0.04135 0.0345
Note: here you have a multi index on the columns. You can can join them with the following.
df_final.columns = ['_'.join(col).strip() for col in df_final.columns.values]
X_sum X_max X_mean X_median Y_sum Y_max Y_mean Y_median Z_sum Z_max Z_mean Z_median H_sum H_max H_mean H_median
int_id
1.0 0.732 0.095 0.0366 0.034 0.739 0.102 0.03695 0.034 0.708 0.079 0.0354 0.034 0.827 0.127 0.04135 0.0345
2.0 0.732 0.095 0.0366 0.034 0.739 0.102 0.03695 0.034 0.708 0.079 0.0354 0.034 0.827 0.127 0.04135 0.0345
Upvotes: 2
Reputation: 8508
Here's the full code to get this done. I have tried to create some sample data to see if this works. Please run this against your full dataset and see if this provides you the results you need.
Step 1: Create a temp list to store the interim dataframe.
temp_list = []
Step 2: Iterate through dataframe 2. For each row selected, do the following:
filter the rows from dataframe 1 for start and end date
temp = df1[df1.Time.between(row.start,row.end)]
Groupby ID and get the statistical values for X, Y, Z, and H. One set for each column
x = temp.groupby('ID' ['X'].agg(['min', 'max', 'mean', 'median']).add_prefix('X_').reset_index()
Merge all the X, Y, Z, H items into a single dataframe.
Add Start and End date to the merged dataframe
Append the dataframe to the temp_list
Step 3: Create a final dataframe with the temp_list
Step 4: Rearrange the columns per your needs. Start & End date as first two columns, then ID, then X values, Y values, Z values, and finally H values.
Step 5: Print the dataframe
Full Code to get this done:
c1 = ['ID','Time','X','Y','Z','H']
d1 = [
['01','2020-02-03 18:13:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 18:13:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 18:13:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 18:13:31',0.015,0.015,0.015,0.015],
['02','2020-02-03 18:13:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 18:13:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 18:13:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 18:13:31',0.025,0.025,0.025,0.025],
['03','2020-02-03 18:13:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 18:13:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 18:13:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 18:13:31',0.035,0.035,0.035,0.035],
['04','2020-02-03 18:13:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 18:13:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 18:13:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 18:13:31',0.045,0.045,0.045,0.045],
['05','2020-02-03 18:13:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 18:13:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 18:13:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 18:13:31',0.095,0.102,0.079,0.127],
['01','2020-02-03 20:03:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 20:03:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 20:03:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 20:03:31',0.015,0.015,0.015,0.015],
['02','2020-02-03 20:03:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 20:03:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 20:03:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 20:03:31',0.025,0.025,0.025,0.025],
['03','2020-02-03 20:03:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 20:03:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 20:03:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 20:03:31',0.035,0.035,0.035,0.035],
['04','2020-02-03 20:03:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 20:03:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 20:03:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 20:03:31',0.045,0.045,0.045,0.045],
['05','2020-02-03 20:03:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 20:03:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 20:03:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 20:03:31',0.095,0.102,0.079,0.127],
['01','2020-07-01 08:59:43',0.063,0.063,0.047,0.079],
['01','2020-07-01 08:59:48',0.055,0.055,0.055,0.079],
['01','2020-07-01 08:59:53',0.071,0.063,0.055,0.082],
['01','2020-07-01 08:59:58',0.063,0.063,0.047,0.082],
['01','2020-07-01 08:59:59',0.047,0.047,0.047,0.071]]
import pandas as pd
df1 = pd.DataFrame(d1,columns=c1)
df1.Time = pd.to_datetime(df1.Time)
c2 = ['int_id','start','end']
d2 = [[1,'2020-02-03 18:11:59','2020-02-03 18:42:00'],
[2,'2020-02-03 19:36:59','2020-02-03 20:06:59'],
[3,'2020-02-03 21:00:59','2020-02-03 21:31:00'],
[4,'2020-02-03 22:38:00','2020-02-03 23:08:00'],
[5,'2020-02-04 05:55:00','2020-02-04 06:24:59'],
[1804,'2021-01-10 13:50:00','2021-01-10 14:20:00'],
[1805,'2021-01-10 18:10:00','2021-01-10 18:40:00'],
[1806,'2021-01-10 19:40:00','2021-01-10 20:10:00'],
[1807,'2021-01-10 21:25:00','2021-01-10 21:55:00'],
[1808,'2021-01-10 22:53:00','2021-01-10 23:23:00']]
import pandas as pd
from functools import reduce
df2 = pd.DataFrame(d2,columns=c2)
df2.start = pd.to_datetime(df2.start)
df2.end = pd.to_datetime(df2.end)
temp_list = []
for i, row in df2.iterrows():
temp = df1[df1.Time.between(row.start,row.end)]
x = temp.groupby('ID')['X'].agg(['min','max','mean','median']).add_prefix('X_').reset_index()
y = temp.groupby('ID')['Y'].agg(['min','max','mean','median']).add_prefix('Y_').reset_index()
z = temp.groupby('ID')['Z'].agg(['min','max','mean','median']).add_prefix('Z_').reset_index()
h = temp.groupby('ID')['H'].agg(['min','max','mean','median']).add_prefix('H_').reset_index()
data_frames = [x,y,z,h]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
how='outer'), data_frames).fillna('void')
df_merged['start'] = row.start
df_merged['end'] = row.end
temp_list.append(df_merged)
df_final = pd.concat(temp_list, ignore_index=True)
column_names = ['start','end','ID',
'X_min','X_max','X_mean','X_median',
'Y_min','Y_max','Y_mean','Y_median',
'Z_min','Z_max','Z_mean','Z_median',
'H_min','H_max','H_mean','H_median']
df_final = df_final[column_names]
print (df_final)
The output of this will be:
start end ID ... H_max H_mean H_median
0 2020-02-03 18:11:59 2020-02-03 18:42:00 01 ... 0.018 0.01500 0.0145
1 2020-02-03 18:11:59 2020-02-03 18:42:00 02 ... 0.028 0.02500 0.0245
2 2020-02-03 18:11:59 2020-02-03 18:42:00 03 ... 0.038 0.03500 0.0345
3 2020-02-03 18:11:59 2020-02-03 18:42:00 04 ... 0.048 0.04500 0.0445
4 2020-02-03 18:11:59 2020-02-03 18:42:00 05 ... 0.127 0.08675 0.0790
5 2020-02-03 19:36:59 2020-02-03 20:06:59 01 ... 0.018 0.01500 0.0145
6 2020-02-03 19:36:59 2020-02-03 20:06:59 02 ... 0.028 0.02500 0.0245
7 2020-02-03 19:36:59 2020-02-03 20:06:59 03 ... 0.038 0.03500 0.0345
8 2020-02-03 19:36:59 2020-02-03 20:06:59 04 ... 0.048 0.04500 0.0445
9 2020-02-03 19:36:59 2020-02-03 20:06:59 05 ... 0.127 0.08675 0.0790
Upvotes: 1