Subset df using timestamps - Python

Question

I'm hoping to subset a df using specific timestamps plus an additional period of time. Using below, df contains specific timestamps that I want to use to subset df2. Essentially, I use the timestamps in df and determine the previous minute. These periods of time are then used to create individual df's, which are concatenated together to create the final df.

However, this is inefficient by itself, but becomes even more so when dealing with multiple times.

import pandas as pd

df = pd.DataFrame({   
        'Time' : ['2020-08-02 10:01:12.5','2020-08-02 11:01:12.5','2020-08-02 12:31:00.0','2020-08-02 12:41:22.6'],             
        'ID' : ['X','Y','B','X'],                 
    })

# 1 min before timestamp
'2020-08-02 10:00:12.5' 
# first timestamp
'2020-08-02 10:01:12.5' 

# 1 min before timestamp
'2020-08-02 11:00:02.1' 
# second timestamp
'2020-08-02 11:01:02.1' 

 df2 = pd.DataFrame({   
        'Time' : ['2020-08-02 10:00:00.1','2020-08-02 10:00:00.2','2020-08-02 10:00:00.3','2020-08-02 10:00:00.4'],             
        'ID' : ['','','',''],                 
    })

d1 = df2[(df2['Time'] > '2020-08-02 10:00:12.5') & (df2['Time'] <= '2020-08-02 10:01:12.5')]
d2 = df2[(df2['Time'] > '2020-08-02 11:00:02.1') & (df2['Time'] <= '2020-08-02 11:01:02.1')]

df_out = pd.concat([d1,d2])#...include all separate periods of time

Intended Output:

                    Time ID
   2020-08-02 10:00:12.5  
   2020-08-02 10:00:12.6  
...
   2020-08-02 11:01:12.5  X
   2020-08-02 11:00:02.1
   2020-08-02 11:00:02.2
...
   2020-08-02 11:01:02.1  Y

perl · Accepted Answer

There's the merge_asof method in pandas that does just that.

Let me use slightly different timestamps compared to the original post to make it a bit easier to illustrate. I'll set df1 timestamps at 10:01, 10:03 and 10:06 for the purpose of this example.

Let's add 1MinBefore column to df with the timestamp one minute before the Time (we'll use it later to merge the dataframes):

df = pd.DataFrame({   
    'Time' : ['2020-08-02 10:01:00','2020-08-02 10:03:00','2020-08-02 10:06:00'],
    'ID' : ['X','Y','Z'],                 
})
df['Time'] = pd.to_datetime(df['Time'])
df['1MinBefore'] = df['Time'] - pd.Timedelta('1min')

So our df is:

                 Time ID          1MinBefore
0 2020-08-02 10:01:00  X 2020-08-02 10:00:00
1 2020-08-02 10:03:00  Y 2020-08-02 10:02:00
2 2020-08-02 10:06:00  Z 2020-08-02 10:05:00

Let's use the range between 10:00 and 10:07 with 30 second intervals for df2:

df2 = pd.DataFrame({   
    'Time' : pd.date_range(
        start='2020-08-02 10:00:00',
        end='2020-08-02 10:07:00',
        freq='30s'),
    'ID' : '',
})

And now the key step, merging these dataframes with merge_asof:

pd.merge_asof(df2[['Time']], df[['ID', '1MinBefore']],
              left_on='Time', right_on='1MinBefore',
              tolerance=pd.Timedelta('1min')

Output:

                  Time   ID          1MinBefore
0  2020-08-02 10:00:00    X 2020-08-02 10:00:00
1  2020-08-02 10:00:30    X 2020-08-02 10:00:00
2  2020-08-02 10:01:00    X 2020-08-02 10:00:00
3  2020-08-02 10:01:30  NaN                 NaT
4  2020-08-02 10:02:00    Y 2020-08-02 10:02:00
5  2020-08-02 10:02:30    Y 2020-08-02 10:02:00
6  2020-08-02 10:03:00    Y 2020-08-02 10:02:00
7  2020-08-02 10:03:30  NaN                 NaT
8  2020-08-02 10:04:00  NaN                 NaT
9  2020-08-02 10:04:30  NaN                 NaT
10 2020-08-02 10:05:00    Z 2020-08-02 10:05:00
11 2020-08-02 10:05:30    Z 2020-08-02 10:05:00
12 2020-08-02 10:06:00    Z 2020-08-02 10:05:00
13 2020-08-02 10:06:30  NaN                 NaT
14 2020-08-02 10:07:00  NaN                 NaT

The tolerance parameter of 1 minute basically tells it that values in df 'older' than 1 minute should be disregarded.

Now we can of course drop 1MinBefore column and use fillna on the ID column to make it look exactly like the Intended Output in the original post.

Subset df using timestamps - Python

Answers (1)

Related Questions