connor449
connor449

Reputation: 1679

How to merge/aggregate data within range of floats that is closest to the values of another column in pandas

I want to merge 2 dataframes (df_a and df_b) on the shared column 'time_stamp'. Both contain floats, however 'time_stamp' in df_a increments by 1, and 'time_stamp' in df_b increments by .33. I want to merge them and aggregate all values of the df_b columns (by max) that fall within df_a time_stamp intervals. Please advise.

df_a


time_stamp  data
0.0         b
1.0         b
2.0         c
3.0         a


df_b

time_stamp   data_x   data_y
0.33         1          0
0.66         0          0
0.99         1          0
1.32         1          3
1.65         1          0
1.98         0          0
2.31         0          0
2.64         0          0
2.97         0          0

This is an example of the desired result:

time_stamp  data  data_x  data_y
0.0          b     1        0
1.0          b     1        3
2.0          c     0        0
3.0          a    N/A      N/A

Upvotes: 0

Views: 189

Answers (1)

Luiz Fernando Lobo
Luiz Fernando Lobo

Reputation: 704

First define the step that you want.

step = 1 

Create a cut, using the step

 max_time = df['time_stamp'].max()
 #Adding the step to the max_time or it will ignore the last value
 cutp = pd.cut(df["time_stamp"], np.arange(0,max_time+step,step))

Filter

df_grouped = df.groupby(cutp).max()

Result

            time_stamp  data_x  data_y
time_stamp
(0.0, 1.0]        0.99       1       0
(1.0, 2.0]        1.98       1       3
(2.0, 3.0]        2.97       0       0

Setting the data frame the way it should.

df_grouped = df_grouped.drop(columns=['time_stamp']).reset_index()
df_grouped['time_stamp'] = df_grouped['time_stamp'].apply(lambda val: val.left)

Result after treatment:

   time_stamp  data_x  data_y
0         0.0       1       0
1         1.0       1       3
2         2.0       0       0

Merging then together

dfa.merge(df_grouped,on='time_stamp',how='left')

Result

   time_stamp data  data_x  data_y
0         0.0    b     1.0     0.0
1         1.0    b     1.0     3.0
2         2.0    c     0.0     0.0
3         3.0    a     NaN     NaN

This seems kind of overwork but this solution will get the max of both columns data_x and data_y something that merge_asof won't, by my tests it only gets the first column (in the case data_x).

Upvotes: 1

Related Questions