Reputation: 49

Pandas: Filter dataframe by another dataframe of time intervals

If I have a dataframe (df_data) like:

ID        Time                X        Y        Z        H
05  2020-06-26 14:13:16    0.055    0.047    0.039    0.062
05  2020-06-26 14:13:21    0.063    0.063    0.055    0.079
05  2020-06-26 14:13:26    0.063    0.063    0.063    0.079
05  2020-06-26 14:13:31    0.095    0.102    0.079    0.127
...    ..    ...     ...     ...      ...      ...      ...
01  2020-07-01 08:59:43    0.063    0.063    0.047    0.079
01  2020-07-01 08:59:48    0.055    0.055    0.055    0.079
01  2020-07-01 08:59:53    0.071    0.063    0.055    0.082
01  2020-07-01 08:59:58    0.063    0.063    0.047    0.082
01  2020-07-01 08:59:59    0.047    0.047    0.047    0.071

[17308709 rows x 8 columns]

which I want to filter by another dataframe of intervals (df_intervals), like:

int_id         start               end
1            2020-02-03 18:11:59   2020-02-03 18:42:00
2            2020-02-03 19:36:59   2020-02-03 20:06:59
3            2020-02-03 21:00:59   2020-02-03 21:31:00
4            2020-02-03 22:38:00   2020-02-03 23:08:00
5            2020-02-04 05:55:00   2020-02-04 06:24:59
...                         ...                   ...
1804         2021-01-10 13:50:00   2021-01-10 14:20:00
1805         2021-01-10 18:10:00   2021-01-10 18:40:00
1806         2021-01-10 19:40:00   2021-01-10 20:10:00
1807         2021-01-10 21:25:00   2021-01-10 21:55:00
1808         2021-01-10 22:53:00   2021-01-10 23:23:00

[1808 rows x 2 columns]

what is the most efficient way to do so? I have a large dataset and if I try to iterate over it like:

for i in range(len(intervals)):
    df_filtered = df[df['Time'].between(intervals['start'][i], intervals['end'][i])
    ...
    ...
    ...

it takes forever! I know that I shouldn't iterate over large dataframes, but I have no idea how I could filter it by every interval on the second dataframe.

The steps I'm trying to do are:

1- Get all the intervals (start/end columns) from df_intervals;

2- Use those intervals to create a new dataframe (df_stats) containing the statistics of the columns within those time ranges. Example:

      start                  end             ID    X_max    X_min    X_mean    Y_max    Y_min    Y_mean    ....
2020-02-03 18:11:59   2020-02-03 18:42:00    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    03    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    04    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    05    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    03    ...    ...    ...     ...   ...    ...    ...     ...

Upvotes: 3

Answers (2)

Correy Koshnick

Reputation: 211

If Joe's answer does not give you the speed you desire I think it can be improved by eliminating the stats calculation in your for loop. ( I'm stealing his df creation because he is a hero for putting that in his answer.) Ideally, you could get rid of the for loop too, but I think with the timestamp indices being duplicated (across the ID numbers) it could be tricky to merge these two data frames.

Here is my attempt still using the iteration to handle the start/end times. First, I apply the int_id to the parent df. I want to add this to the parent dataframe so I can 'groupby' without making a 'temp' dataframe and doing my statistics to that.

for index, row in df2.iterrows():
    
    df1.loc[df1.Time.between(row.start,row.end), 'int_id'] = row.int_id

    ID                Time      X      Y      Z      H  int_id
0   01 2020-02-03 18:13:16  0.011  0.012  0.013  0.014     1.0
1   01 2020-02-03 18:13:21  0.015  0.016  0.017  0.018     1.0
2   01 2020-02-03 18:13:26  0.013  0.013  0.013  0.013     1.0
3   01 2020-02-03 18:13:31  0.015  0.015  0.015  0.015     1.0
4   02 2020-02-03 18:13:16  0.021  0.022  0.023  0.024     1.0
5   02 2020-02-03 18:13:21  0.025  0.026  0.027  0.028     1.0
6   02 2020-02-03 18:13:26  0.023  0.023  0.023  0.023     1.0

Then I define the aggregations to get it all in one go AFTER the loop is finished.

aggs = {'X':['sum', 'max', 'mean', 'median'], 
        'Y':['sum', 'max', 'mean', 'median'], 
        'Z':['sum', 'max', 'mean', 'median'], 
        'H':['sum', 'max', 'mean', 'median']}

df2 = df1.groupby(by=('int_id')).agg(aggs)

            X                            Y                             Z                            H                        
          sum    max    mean median    sum    max     mean median    sum    max    mean median    sum    max     mean  median
int_id                                                                                                                       
1.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345
2.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345

Note: here you have a multi index on the columns. You can can join them with the following.

df_final.columns = ['_'.join(col).strip() for col in df_final.columns.values]

        X_sum  X_max  X_mean  X_median  Y_sum  Y_max   Y_mean  Y_median  Z_sum  Z_max  Z_mean  Z_median  H_sum  H_max   H_mean  H_median
int_id                                                                                                                                  
1.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345
2.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345

Upvotes: 2

Joe Ferndz

Reputation: 8508

Here's the full code to get this done. I have tried to create some sample data to see if this works. Please run this against your full dataset and see if this provides you the results you need.

Step 1: Create a temp list to store the interim dataframe.

temp_list = []
Step 2: Iterate through dataframe 2. For each row selected, do the following:
- filter the rows from dataframe 1 for start and end date
  
  temp = df1[df1.Time.between(row.start,row.end)]
- Groupby ID and get the statistical values for X, Y, Z, and H. One set for each column
  
  x = temp.groupby('ID' ['X'].agg(['min', 'max', 'mean', 'median']).add_prefix('X_').reset_index()
- Merge all the X, Y, Z, H items into a single dataframe.
- Add Start and End date to the merged dataframe
- Append the dataframe to the temp_list
Step 3: Create a final dataframe with the temp_list
Step 4: Rearrange the columns per your needs. Start & End date as first two columns, then ID, then X values, Y values, Z values, and finally H values.
Step 5: Print the dataframe

Full Code to get this done:

c1 = ['ID','Time','X','Y','Z','H']
d1 = [
['01','2020-02-03 18:13:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 18:13:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 18:13:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 18:13:31',0.015,0.015,0.015,0.015],
     
['02','2020-02-03 18:13:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 18:13:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 18:13:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 18:13:31',0.025,0.025,0.025,0.025],
     
['03','2020-02-03 18:13:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 18:13:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 18:13:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 18:13:31',0.035,0.035,0.035,0.035],

['04','2020-02-03 18:13:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 18:13:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 18:13:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 18:13:31',0.045,0.045,0.045,0.045],
     
['05','2020-02-03 18:13:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 18:13:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 18:13:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 18:13:31',0.095,0.102,0.079,0.127],
     
['01','2020-02-03 20:03:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 20:03:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 20:03:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 20:03:31',0.015,0.015,0.015,0.015],
     
['02','2020-02-03 20:03:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 20:03:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 20:03:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 20:03:31',0.025,0.025,0.025,0.025],
     
['03','2020-02-03 20:03:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 20:03:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 20:03:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 20:03:31',0.035,0.035,0.035,0.035],

['04','2020-02-03 20:03:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 20:03:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 20:03:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 20:03:31',0.045,0.045,0.045,0.045],
     
['05','2020-02-03 20:03:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 20:03:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 20:03:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 20:03:31',0.095,0.102,0.079,0.127],
     
['01','2020-07-01 08:59:43',0.063,0.063,0.047,0.079],
['01','2020-07-01 08:59:48',0.055,0.055,0.055,0.079],
['01','2020-07-01 08:59:53',0.071,0.063,0.055,0.082],
['01','2020-07-01 08:59:58',0.063,0.063,0.047,0.082],
['01','2020-07-01 08:59:59',0.047,0.047,0.047,0.071]]

import pandas as pd
df1 = pd.DataFrame(d1,columns=c1)
df1.Time = pd.to_datetime(df1.Time)

c2 = ['int_id','start','end']
d2 = [[1,'2020-02-03 18:11:59','2020-02-03 18:42:00'],
[2,'2020-02-03 19:36:59','2020-02-03 20:06:59'],
[3,'2020-02-03 21:00:59','2020-02-03 21:31:00'],
[4,'2020-02-03 22:38:00','2020-02-03 23:08:00'],
[5,'2020-02-04 05:55:00','2020-02-04 06:24:59'],
[1804,'2021-01-10 13:50:00','2021-01-10 14:20:00'],
[1805,'2021-01-10 18:10:00','2021-01-10 18:40:00'],
[1806,'2021-01-10 19:40:00','2021-01-10 20:10:00'],
[1807,'2021-01-10 21:25:00','2021-01-10 21:55:00'],
[1808,'2021-01-10 22:53:00','2021-01-10 23:23:00']]

import pandas as pd
from functools import reduce

df2 = pd.DataFrame(d2,columns=c2)

df2.start = pd.to_datetime(df2.start)
df2.end = pd.to_datetime(df2.end)

temp_list = []

for i, row in df2.iterrows():

    temp = df1[df1.Time.between(row.start,row.end)]

    x = temp.groupby('ID')['X'].agg(['min','max','mean','median']).add_prefix('X_').reset_index()
    y = temp.groupby('ID')['Y'].agg(['min','max','mean','median']).add_prefix('Y_').reset_index()
    z = temp.groupby('ID')['Z'].agg(['min','max','mean','median']).add_prefix('Z_').reset_index()
    h = temp.groupby('ID')['H'].agg(['min','max','mean','median']).add_prefix('H_').reset_index()

    data_frames = [x,y,z,h]

    df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
                            how='outer'), data_frames).fillna('void')

    df_merged['start'] = row.start
    df_merged['end'] = row.end
    
    temp_list.append(df_merged)


df_final = pd.concat(temp_list, ignore_index=True)

column_names = ['start','end','ID',
                    'X_min','X_max','X_mean','X_median',
                    'Y_min','Y_max','Y_mean','Y_median',
                    'Z_min','Z_max','Z_mean','Z_median',
                    'H_min','H_max','H_mean','H_median']

df_final = df_final[column_names]

print (df_final)

The output of this will be:

                start                 end  ID  ...  H_max   H_mean  H_median
0 2020-02-03 18:11:59 2020-02-03 18:42:00  01  ...  0.018  0.01500    0.0145
1 2020-02-03 18:11:59 2020-02-03 18:42:00  02  ...  0.028  0.02500    0.0245
2 2020-02-03 18:11:59 2020-02-03 18:42:00  03  ...  0.038  0.03500    0.0345
3 2020-02-03 18:11:59 2020-02-03 18:42:00  04  ...  0.048  0.04500    0.0445
4 2020-02-03 18:11:59 2020-02-03 18:42:00  05  ...  0.127  0.08675    0.0790
5 2020-02-03 19:36:59 2020-02-03 20:06:59  01  ...  0.018  0.01500    0.0145
6 2020-02-03 19:36:59 2020-02-03 20:06:59  02  ...  0.028  0.02500    0.0245
7 2020-02-03 19:36:59 2020-02-03 20:06:59  03  ...  0.038  0.03500    0.0345
8 2020-02-03 19:36:59 2020-02-03 20:06:59  04  ...  0.048  0.04500    0.0445
9 2020-02-03 19:36:59 2020-02-03 20:06:59  05  ...  0.127  0.08675    0.0790

Upvotes: 1

Pandas: Filter dataframe by another dataframe of time intervals

Answers (2)

Related Questions