mubas007
mubas007

Reputation: 131

How to merge a row to its closest row (based on timestamp) in Pandas?

I am relatively new to Pandas so my sincere apologies if my question was not framed properly, I have a data set here,

                       t  HVAC1_S1  HVAC2_S1  ...  HVAC4_S1  HVAC5_S1  HVAC6_S1
0       2009-08-18 18:12     711.0       0.0  ...       0.0       0.0       0.0
1       2009-08-18 18:14     705.0     734.0  ...       0.0     725.0     711.0
2       2009-08-18 18:15       0.0       0.0  ...     730.0       0.0       0.0
3       2009-08-18 18:29       0.0     721.0  ...     716.0     718.0     712.0
4       2009-08-18 18:30     705.0       0.0  ...       0.0       0.0       0.0
                 ...       ...       ...  ...       ...       ...       ...
156033  2012-11-09 21:59     714.0     720.0  ...     739.0       0.0     727.0
156034  2012-11-09 22:00       0.0       0.0  ...       0.0     743.0       0.0
156035  2012-11-09 22:14     723.0     729.0  ...     734.0     743.0     732.0
156036  2012-11-09 22:29     718.0     732.0  ...       0.0     739.0     725.0
156037  2012-11-09 22:30       0.0       0.0  ...     739.0       0.0       0.0

So if you notice the timestamp, for example, the index '1' i.e. at 2009-08-18 18:14 there were 5 readings taken (one is not visible because of ...). Now actually there are 6 readings in total but last reading was taken at the next minute i.e. at 2009-08-18 18:15. Well, it is a bit confusing because there are no seconds here. Moreover, all the 6 readings are taken at a time and then after 15 minutes they are measured again. Now while measuring one value often gets measured at the next minute because of which it is in a new row.

I want to set a tolerance of 1 minute and merge the value of such rows (like at index 2,4, 156034, 156037) to its previous rows and delete that row from data frame

so for e.g. I would like to obtain my dataframe as,

                       t  HVAC1_S1  HVAC2_S1  ...  HVAC4_S1  HVAC5_S1  HVAC6_S1
0       2009-08-18 18:12     711.0       0.0  ...       0.0       0.0       0.0
1       2009-08-18 18:14     705.0     734.0  ...      730.0     725.0     711.0
2       2009-08-18 18:29     705.0     721.0  ...     716.0     718.0     712.0
                 ...       ...       ...  ...       ...       ...       ...
156033  2012-11-09 21:59     714.0     720.0  ...     739.0     743.0     727.0
156035  2012-11-09 22:14     723.0     729.0  ...     734.0     743.0     732.0
156036  2012-11-09 22:29     718.0     732.0  ...     739.0     739.0     725.0

Note: index 0 can be neglected because it is out of the 1 min tolerance.

please provide your valuable suggestion. Thank you!

Upvotes: 1

Views: 570

Answers (1)

ALollz
ALollz

Reputation: 59549

You can use an asof merge to merge the DataFrame with itself. Pick the forward direction, and do not allow exact matches (else the same row would always merge with itself). We can specify a tolerance of 1 minute. We need to bring the index along in the merge so we can remove rows that merged with others after the merge.

After the merge we'll have _x and _y columns we can remove those and group along the columns axis to combine them.

Input Data:

                    t  HVAC1_S1  HVAC2_S1  HVAC4_S1  HVAC5_S1  HVAC6_S1
0 2009-08-18 18:12:00     711.0       0.0       0.0       0.0       0.0
1 2009-08-18 18:14:00     705.0     734.0       0.0     725.0     711.0
2 2009-08-18 18:15:00       0.0       0.0     730.0       0.0       0.0
3 2009-08-18 18:29:00       0.0     721.0     716.0     718.0     712.0
4 2009-08-18 18:30:00     705.0       0.0       0.0       0.0       0.0

Code:

import pandas as pd
#df['t'] = pd.to_datetime(df['t'])  #If not datetime

res = pd.merge_asof(df.reset_index(), df.reset_index(), 
                    on='t', 
                    direction='forward', 
                    allow_exact_matches=False,
                    tolerance=pd.Timedelta('1min'))
res = res.set_index('t')

res = res[~res.index_x.isin(res.index_y)]  #Remove rows that merged with others
res = res.drop(columns=['index_x', 'index_y'])

res = res.groupby(res.columns.str.rsplit('_', n=1).str[0], axis=1).sum()

Output:

                     HVAC1_S1  HVAC2_S1  HVAC4_S1  HVAC5_S1  HVAC6_S1
t                                                                    
2009-08-18 18:12:00     711.0       0.0       0.0       0.0       0.0
2009-08-18 18:14:00     705.0     734.0     730.0     725.0     711.0
2009-08-18 18:29:00     705.0     721.0     716.0     718.0     712.0

Upvotes: 1

Related Questions