biwia
biwia

Reputation: 387

pandas: groupby respecting conditions in python

I have the dataframe below with date, id, revenue.

example of dataframe:

import pandas as pd
from io import StringIO

s = """date,id,revenue
2021-05-05,X1,50
2021-05-05,X2,30
2021-05-05,X3,20
2021-05-06,X1,300
2021-05-06,X2,500
2021-05-01,X1,5"""

df = pd.read_csv(StringIO(s))

I want to do some classification and calculate for each date, the number of persons (id) if revenue is respecting some conditions. Count nbr of clients (id) if revenue is between 0 and 10. Count nbr of clients (id) if revenue is between 20 and 50. Count nbr of clients (id) if revenue is between 100 and 1000.

In my case the expected output should be

import pandas as pd
from io import StringIO

s = """date,id,revenue, interval
2021-05-05,X1,50, 3
2021-05-05,X2,30, 3
2021-05-05,X3,20, 3
2021-05-06,X1,300, 2
2021-05-06,X2,500, 2
2021-05-01,X1,5, 1"""

df = pd.read_csv(StringIO(s))

Upvotes: 0

Views: 51

Answers (1)

ifly6
ifly6

Reputation: 5331

You first need to cut your data.

df['bins'] = pd.cut(df['revenue'], bins=
    pd.IntervalIndex.from_tuples([(0, 10), (20, 50), (100, 1000)], closed='both'))

The data frame now looks like this:

         date  id  revenue         bins
0  2021-05-05  X1       50     [20, 50]
1  2021-05-05  X2       30     [20, 50]
2  2021-05-05  X3       20     [20, 50]
3  2021-05-06  X1      300  [100, 1000]
4  2021-05-06  X2      500  [100, 1000]
5  2021-05-01  X1        5      [0, 10]

To cut properly, create an interval index which includes the bounds (ie, passing closed='both'). This all needs to be specified explicitly rather than with just a list of scalars, because of the bound-inclusivity. Note that you'll get unexpected output if your revenue falls outside one of these discontinuous intervals, eg, if revenue is 75.

Then just groupby and count.

bin_count = df.groupby('bins')['id'].count()

bin_count looks like this:

[0, 10]        1
[20, 50]       3
[100, 1000]    2
Name: id, dtype: int64

If you want to join the numbers back to your data frame, which seems to be what you have in your expected output under the name interval (which is a misnomer, use a different name) use pd.Series.map from the bins: df['bin_count'] = df['bins'].map(bin_count).

Upvotes: 1

Related Questions