Categorizing Pandas column with the indiviual custom bins (tresholds)

Question

I have a dataframe with some numeric values stored in column "value", accompanied by their respective categorical tresholds (warning levels in this case), stored in other columns (in my case "low", "middle", "high"):

     value    low  middle    high
0   179.69  17.42   88.87  239.85
1     2.58  17.81   93.37  236.58
2     1.21   0.05    0.01    0.91
3     1.66   0.20    0.32    4.57
4     3.54   0.04    0.04    0.71
5     5.97   0.16    0.17    2.55
6     5.39   0.86    1.62    9.01
7     1.20   0.03    0.01    0.31
8     3.19   0.08    0.01    0.45
9     0.02   0.03    0.01    0.10
10    3.98   0.18    0.05    0.83
11  134.51  78.63  136.86  478.27
12  254.53  83.73  146.33  486.65
13   15.36  86.07   13.74  185.16
14   85.10  86.12   13.74  185.16
15   15.12   1.37    6.09   30.12

I would like to know in which category each value falls (e.g. first value would be middle, second would be below_low, since it's smaller than any of its tresholds, third would be high, ... you get the idea). So here is the expected output:

     value    low  middle    high   category
0   179.69  17.42   88.87  239.85     middle
1     2.58  17.81   93.37  236.58  below_low
2     1.21   0.05    0.01    0.91       high
3     1.66   0.20    0.32    4.57     middle
4     3.54   0.04    0.04    0.71       high
5     5.97   0.16    0.17    2.55       high
6     5.39   0.86    1.62    9.01     middle
7     1.20   0.03    0.01    0.31       high
8     3.19   0.08    0.01    0.45       high
9     0.02   0.03    0.01    0.10     middle
10    3.98   0.18    0.05    0.83       high
11  134.51  78.63  136.86  478.27        low
12  254.53  83.73  146.33  486.65     middle
13   15.36  86.07   13.74  185.16     middle
14   85.10  86.12   13.74  185.16     middle
15   15.12   1.37    6.09   30.12     middle

So far I use this ugly procedure of "manually" checking line by line, stopping at the first category (from higher to lower), being bigger that the current value:

df["category"]="below_low"    
for i in df.index:
    for cat in ["high","middle","low"]:
        if df.loc[i,"value"]>df.loc[i,cat]:
            df.loc[i,"category"]=cat
            break

I am aware of the pd.cut() method, but I only know how to use it with a predefined generic tresholds list. Can somebody tell what am I missing?

jezrael · Accepted Answer

You can use:

remove column value
compare with lt (less then)
change order of columns
cumulative sum columns - first True get 1
compare with 1 by eq

mask = df.drop('value',axis=1)
         .lt(df['value'], axis=0)
         .reindex(columns=['high','middle','low'])
         .cumsum(axis=1)
         .eq(1)

If all values in columns high, middle and low are False then some correctness is necessary. I create new column with inverting mask and all.

mask['below_low'] = (~mask).all(axis=1)
print (mask)
     high middle    low below_low
0    True  False  False     False
1   False  False  False      True
2    True  False  False     False
3   False   True  False     False
4    True  False  False     False
5    True  False  False     False
6   False   True  False     False
7    True  False  False     False
8    True  False  False     False
9   False   True   True     False
10   True  False  False     False
11  False  False   True     False
12  False   True  False     False
13  False   True   True     False
14  False   True   True     False
15  False   True  False     False

Last call DataFrame.idxmax:

df['category'] = mask.idxmax(axis=1)
print (df)
     value    low  middle    high   category
0   179.69  17.42   88.87  239.85       high
1     2.58  17.81   93.37  236.58  below_low
2     1.21   0.05    0.01    0.91       high
3     1.66   0.20    0.32    4.57     middle
4     3.54   0.04    0.04    0.71       high
5     5.97   0.16    0.17    2.55       high
6     5.39   0.86    1.62    9.01     middle
7     1.20   0.03    0.01    0.31       high
8     3.19   0.08    0.01    0.45       high
9     0.02   0.03    0.01    0.10     middle
10    3.98   0.18    0.05    0.83       high
11  134.51  78.63  136.86  478.27        low
12  254.53  83.73  146.33  486.65     middle
13   15.36  86.07   13.74  185.16     middle
14   85.10  86.12   13.74  185.16     middle
15   15.12   1.37    6.09   30.12     middle

Solution with multiple numpy.where as pointed Paul H:

df['category'] = np.where(df['high'] < df['value'], 'high', 
                 np.where(df['middle'] < df['value'], 'medium',
                 np.where(df['low'] < df['value'], 'low', 'below_low')))

print (df)
     value    low  middle    high   category
0   179.69  17.42   88.87  239.85       high
1     2.58  17.81   93.37  236.58  below_low
2     1.21   0.05    0.01    0.91       high
3     1.66   0.20    0.32    4.57     medium
4     3.54   0.04    0.04    0.71       high
5     5.97   0.16    0.17    2.55       high
6     5.39   0.86    1.62    9.01     medium
7     1.20   0.03    0.01    0.31       high
8     3.19   0.08    0.01    0.45       high
9     0.02   0.03    0.01    0.10     medium
10    3.98   0.18    0.05    0.83       high
11  134.51  78.63  136.86  478.27        low
12  254.53  83.73  146.33  486.65     medium
13   15.36  86.07   13.74  185.16     medium
14   85.10  86.12   13.74  185.16     medium
15   15.12   1.37    6.09   30.12     medium

Categorizing Pandas column with the indiviual custom bins (tresholds)

Answers (2)

Related Questions