How to do greater than/less than binning with pandas DataFrame?

Question

I have a pandas DataFrame as follows:

import pandas as pd
import numpy as np

data = {"first_column": ["item1", "item2", "item3", "item4", "item5", "item6", "item7"],
        "second_column": ["cat1", "cat1", "cat1", "cat2", "cat2", "cat2", "cat2"],
        "third_column": [5, 1, 8, 3, 731, 189, 9]}

df = pd.DataFrame(data)

df
     first_column second_column  third_column
0        item1          cat1             5
1        item2          cat1             1
2        item3          cat1             8
3        item4          cat2             3
4        item5          cat2           731
5        item6          cat2           189
6        item7          cat2             9

Now, let's say I wanted to create a fourth column showing the classification of the third column using pandas.cut(). Here, I label each row whether the element in third_column is less than or equal to ten, <=10.

df["less_than_ten"]= pd.cut(df.third_column, [-np.inf, 10, np.inf], labels=(1,0))

And the resulting dataframe is now:

      first_column second_column  third_column less_than_ten
0        item1          cat1             5             1
1        item2          cat1             1             1
2        item3          cat1             8             1
3        item4          cat2             3             1
4        item5          cat2           731             0
5        item6          cat2           189             0
6        item7          cat2             9             1

Question: Notice the second column second_column, with categories cat1 and cat2. How would I use pandas.cut() to reclassify these values based on the "class" in second_column?

More importantly, let's say I wanted more complex intervals, e.g. less or equal to 500 le(500) and greater than or equal to 20 ge(20)? How would this be done? In this case, there should be a 1 labeled by grouping:

      first_column second_column  third_column less_than_ten
0        item1          cat1             5             1
1        item2          cat1             1             1
2        item3          cat1             8             1
3        item4          cat2             3             1
4        item5          cat2           731             0
5        item6          cat2           189             1
6        item7          cat2             9             1

MaxU - stand with Ukraine · Accepted Answer

I wouldn't use pd.cut in this case:

df['less_than_ten'] = df.third_column.le(10).astype(np.uint8)
df.loc[df.second_column=='cat2','less_than_ten'] = \
    df.loc[df.second_column=='cat2','third_column'].le(10).astype(np.uint8) + 2

Result:

In [99]: df
Out[99]:
  first_column second_column  third_column  less_than_ten
0        item1          cat1             5              1
1        item2          cat1             1              1
2        item3          cat1             8              1
3        item4          cat2             3              3
4        item5          cat2           731              2
5        item6          cat2           189              2
6        item7          cat2             9              3

How to do greater than/less than binning with pandas DataFrame?

Answers (2)

Related Questions