connor449
connor449

Reputation: 1679

Grouping values from one column that fall between other column values plus conditional statement

I am looking to merge two data frames that look as below:

unit      start    stop
A      0.0    8.15
B      9.18   11.98
A     13.07   13.80
B     13.82   15.00
A     16.46   17.58


df_2
time    other_data
1            5
2            5
3            6
4            10
5            5
6            2
7            1
8            5
9            5
10           7
11           5
12           5
13           5
14           10
15           5
16           4
17           4
18           4

I want to merge df_2.other_data onto df_1. However, I want to group df_2.other_data where df_2.time is between df_1.start_time and df_2.stop_time.

From each group I want to extract the following number based on the pseudo-code below:

for other_data between start_time and stop_time:
   if 1 or 2 in other_data:
       value = other_data.min()
   elif 10 in other_data and 1 or 2 not in other_data:
        value = other_data.max()
   elif other_data.mode() > 1:
        value = other_data.min()
   else:
        value = other_data.mode()

For the first example group [5, 5, 6], the value would be 5, the mode. For the second example group [6, 10, 5, 2], the value would be 2.

This should result in a df that looks like this:

unit      start    stop  other_data
A      0.0    8.15         1   
B      9.18   11.98        5
A     13.07   13.80        5
B     13.82   15.00        10
A     16.46   17.58        4

However, with the code below I get this result:

m = np.greater_equal.outer(df_2.time.values, df_1.start.values) & \
    np.less_equal.outer(df_2.time.values, df_1.stop.values)

def logic(s):
    if s.isin([1,2]).any(): return s.min()
    if s.isin([10]).any(): return s.max()
    return s.mode()[0]

df_1['other_data'] = df_2.other_data.groupby(m.cumsum(1).argmax(1)).agg(logic)


      start    stop other_data
0      7.53    8.15          1
1      9.18   11.98          5
2     13.07   13.80        NaN
3     13.82   15.00         10
4     16.46   17.58        NaN

Upvotes: 1

Views: 50

Answers (2)

rafaelc
rafaelc

Reputation: 59264

I would definitely avoid the loops and go with a vectorized solution.

m = np.greater_equal.outer(df2.time.values, df.start_time.values) & \
    np.less_equal.outer(df2.time.values, df.stop_time.values)

def logic(s):
    if s.isin([1,2]).any(): return s.min()
    if s.isin([10]).any(): return s.max()
    return s.mode()

df['other_data'] = df2.other_data.groupby(m.cumsum(1).argmax(1)).agg(logic)

  unit  start_time  stop_time  other_data
0    A         0.0        3.2           5
1    B         3.3        6.1           2
2    A         6.2       10.5           1
3    B        10.6       13.2           5
4    A        13.3       16.0          10

Upvotes: 1

Mulham Jarjanazi
Mulham Jarjanazi

Reputation: 386

You can create a while loop that adds values into each row until df_2.time is out of range.

first you create a list:

other_data = []

then you do something like this:


i = 0
j = 0
while i < len(df_1["unit"]) and j < len(df_2["time"]):
    current_row = []
    while df_2["time"][j] > df_1["start_time"][i] and df_2["time"][j] < df_1["stop_time"][i]:
        current_row.append(df_2["other_data"][j])
        j += 1
    i += 1
    other_data.append(current_row)

You can then easily turn other_data to a numpy array and append it to your dataframe. You may want to change the conditions of the range as you prefer.

Upvotes: 0

Related Questions