Reputation: 1679
I have two dataframes that I want to combine. They look as follows:
df_1
unit start_time stop_time
A 0.0 1.2
B 1.3 4.1
A 4.2 4.5
B 4.6 7.2
A 7.3 8.0
df_2
time other_data
0.2 .0122
0.4 .0128
0.6 .0101
0.8 .0091
1.0 .2122
1.2 .1542
1.4 .1546
1.6 .1522
1.8 .2542
2.0 .1557
2.2 .2542
2.4 .1543
2.6 .0121
2.8 .0111
3.0 .0412
3.2 .0214
3.4 .0155
3.6 .0159
3.8 .0154
4.0 .0155
4.2 .0211
4.4 .0265
4.6 .0146
4.8 .0112
5.0 .0166
5.2 .0101
5.4 .0132
5.6 .0112
5.8 .0121
6.0 .0142
6.2 .0124
6.4 .0111
6.6 .0123
6.8 .0111
6.0 .0119
6.2 .0112
6.4 .0131
6.6 .0117
6.8 .0172
7.0 .0123
7.2 .0127
7.4 .0121
7.6 .0110
7.8 .0120
8.0 .0121
I want to merge these data frames using the following criteria:
I want to group all values from df_2.other_data where df_2.time is between df_1.start_time and df_1.stop_time. For example, for the first row of df_1, the following data from df_2 would be grouped:
time other_data
0.2 .0122
0.4 .0128
0.6 .0101
0.8 .0091
1.0 .2122
1.2 .1542
Within this group, I want to count the total number of observations where df_2.other_data is above a threshold, which in this case will be set to .0120. The total number of observations in this group above this threshold is 4. This is the value I want to merge onto df_1. The result should look like this:
unit start_time stop_time other_data_above_threshold
A 0.0 1.2 4
The final dataframe should look like this:
unit start_time stop_time other_data_above_threshold
A 0.0 1.2 4
B 1.3 4.1 13
A 4.2 4.5 3
B 4.6 7.2 11
A 7.3 8.0 4
Upvotes: 2
Views: 65
Reputation: 25259
You may try use pd.cut
a = df_1.start_time.to_list() + [np.inf]
s = pd.cut(df_2.time, bins=a, labels=df_1.index, right=False)
df_1['other_data_above_threshold'] = df_2.other_data.gt(0.012).groupby(s).sum()
Out[213]:
unit start_time stop_time other_data_above_threshold
0 A 0.0 1.2 4.0
1 B 1.3 4.1 13.0
2 A 4.2 4.5 2.0
3 B 4.6 7.2 11.0
4 A 7.3 8.0 2.0
Upvotes: 0
Reputation: 8033
IIUC, this is what you need.
df['other_data_at'] = df.apply(lambda x: df2.loc[(df2['time']>= x['start_time']) & (df2['time']<= x['stop_time'])].loc[df2['other_data']>=0.012].count()[0], axis=1)
Output
unit start_time stop_time other_data_at
0 A 0.0 1.2 4
1 B 1.3 4.1 13
2 A 4.2 4.5 2 #you expected output shows 3 but it should be 2
3 B 4.6 7.2 11
4 A 7.3 8.0 3
Upvotes: 1
Reputation: 91
Hi I would try to loop over your df1 and use it's values for df2
That would look a little like this:
def my_counting(df1, df2, threshold):
count_list = ()
for index,unit in enumerate(df['unit']):
df = df2[(df2['time'] >= df1['start_time'][index]) & (df2['time'] < df1['stop_time'][index])]
count_list.append(df[df['other_data'] <= threshold].shape[0])
df1['other_data_above_threshold'] = count_list
return df1
print(my_counting(df1, df2, 0.012)
Upvotes: 0