Reputation: 1679
I am looking to merge two data frames that look as below:
unit start stop
A 0.0 8.15
B 9.18 11.98
A 13.07 13.80
B 13.82 15.00
A 16.46 17.58
df_2
time other_data
1 5
2 5
3 6
4 10
5 5
6 2
7 1
8 5
9 5
10 7
11 5
12 5
13 5
14 10
15 5
16 4
17 4
18 4
I want to merge df_2.other_data onto df_1. However, I want to group df_2.other_data where df_2.time is between df_1.start_time and df_2.stop_time.
From each group I want to extract the following number based on the pseudo-code below:
for other_data between start_time and stop_time:
if 1 or 2 in other_data:
value = other_data.min()
elif 10 in other_data and 1 or 2 not in other_data:
value = other_data.max()
elif other_data.mode() > 1:
value = other_data.min()
else:
value = other_data.mode()
For the first example group [5, 5, 6], the value would be 5, the mode. For the second example group [6, 10, 5, 2], the value would be 2.
This should result in a df that looks like this:
unit start stop other_data
A 0.0 8.15 1
B 9.18 11.98 5
A 13.07 13.80 5
B 13.82 15.00 10
A 16.46 17.58 4
However, with the code below I get this result:
m = np.greater_equal.outer(df_2.time.values, df_1.start.values) & \
np.less_equal.outer(df_2.time.values, df_1.stop.values)
def logic(s):
if s.isin([1,2]).any(): return s.min()
if s.isin([10]).any(): return s.max()
return s.mode()[0]
df_1['other_data'] = df_2.other_data.groupby(m.cumsum(1).argmax(1)).agg(logic)
start stop other_data
0 7.53 8.15 1
1 9.18 11.98 5
2 13.07 13.80 NaN
3 13.82 15.00 10
4 16.46 17.58 NaN
Upvotes: 1
Views: 50
Reputation: 59264
I would definitely avoid the loops and go with a vectorized solution.
m = np.greater_equal.outer(df2.time.values, df.start_time.values) & \
np.less_equal.outer(df2.time.values, df.stop_time.values)
def logic(s):
if s.isin([1,2]).any(): return s.min()
if s.isin([10]).any(): return s.max()
return s.mode()
df['other_data'] = df2.other_data.groupby(m.cumsum(1).argmax(1)).agg(logic)
unit start_time stop_time other_data
0 A 0.0 3.2 5
1 B 3.3 6.1 2
2 A 6.2 10.5 1
3 B 10.6 13.2 5
4 A 13.3 16.0 10
Upvotes: 1
Reputation: 386
You can create a while loop that adds values into each row until df_2.time is out of range.
first you create a list:
other_data = []
then you do something like this:
i = 0
j = 0
while i < len(df_1["unit"]) and j < len(df_2["time"]):
current_row = []
while df_2["time"][j] > df_1["start_time"][i] and df_2["time"][j] < df_1["stop_time"][i]:
current_row.append(df_2["other_data"][j])
j += 1
i += 1
other_data.append(current_row)
You can then easily turn other_data
to a numpy array and append it to your dataframe.
You may want to change the conditions of the range as you prefer.
Upvotes: 0