jerbear
jerbear

Reputation: 391

Create column based on multiple column conditions from another dataframe

Suppose I have two dataframes - conditions and data.

import pandas as pd

conditions = pd.DataFrame({'class': [1,2,3,4,4,5,5,4,4,5,5,5],
                           'primary_lower': [0,0,0,160,160,160,160,160,160,160,160,800],
                           'primary_upper':[9999,9999,9999,480,480,480,480,480,480,480,480,4000],
                           'secondary_lower':[0,0,0,3500,6100,3500,6100,0,4800,0,4800,10],
                           'secondary_upper':[9999,9999,9999,4700,9999,4700,9999,4699,6000,4699,6000,3000],
                           'group':['A','A','A','B','B','B','B','C','C','C','C','C']})

data = pd.DataFrame({'class':[1,1,4,4,5,5,2],
                     'primary':[2000,9100,1100,170,300,210,1000],
                     'secondary':[1232,3400,2400,380,3600,4800,8600]})

I'd like to generate a new column (group) in the "data" table that assigns a group to each row given the conditions provided in the "conditions" table.

The conditions table is structured so that rows within each group are joined by "OR"s and columns are joined by "AND"s. For example, to be assigned group "B":

(class = 4 AND 160<=primary<=480 AND 3500<=secondary<=4700)

OR

(class = 4 AND 160<=primary<=480 AND 6100<=secondary<=9999)

OR

(class = 5 AND 160<=primary<=480 AND 3500<=secondary<=4700)

OR

(class = 5 AND 160<=primary<=480 AND 6100<=secondary<=9999)

Any rows that don't match any of the conditions will get assigned group "Other". So, the final dataframe would then look like this:

+-------+---------+-----------+-------+
| class | primary | secondary | group |
+-------+---------+-----------+-------+
|     1 |    2000 |      1232 | A     |
|     1 |    9100 |      3400 | A     |
|     4 |    1100 |      2400 | Other |
|     4 |     170 |       380 | C     |
|     5 |     300 |      3600 | B     |
|     5 |     210 |      4800 | C     |
|     2 |    1000 |      8600 | A     |
+-------+---------+-----------+-------+

Upvotes: 1

Views: 688

Answers (1)

jpp
jpp

Reputation: 164623

You can iterate a GroupBy object and take the union of the masks within each group:

for key, grp in conditions.groupby('group'):

    cols = ['class', 'primary_lower', 'primary_upper',
            'secondary_lower', 'secondary_upper']

    masks = (data['class'].eq(cls) & \
             data['primary'].between(prim_lower, prim_upper) & \
             data['secondary'].between(sec_lower, sec_upper) \
             for cls, prim_lower, prim_upper, sec_lower, sec_upper in \
             grp[cols].itertuples(index=False))

    data.loc[pd.concat(masks, axis=1).any(1), 'group'] = key

data['group'] = data['group'].fillna('Other')

Result:

print(data)

   class  primary  secondary  group
0      1     2000       1232      A
1      1     9100       3400      A
2      4     1100       2400  Other
3      4      170        380      C
4      5      300       3600      C
5      5      210       4800      C
6      2     1000       8600      A

Note index=4 has a different result to your desired output since there are multiple conditions which satisfy the data.

Upvotes: 1

Related Questions