Reputation: 387
I have the dataframe below with date, id, revenue.
example of dataframe:
import pandas as pd
from io import StringIO
s = """date,id,revenue
2021-05-05,X1,50
2021-05-05,X2,30
2021-05-05,X3,20
2021-05-06,X1,300
2021-05-06,X2,500
2021-05-01,X1,5"""
df = pd.read_csv(StringIO(s))
I want to do some classification and calculate for each date, the number of persons (id) if revenue is respecting some conditions. Count nbr of clients (id) if revenue is between 0 and 10. Count nbr of clients (id) if revenue is between 20 and 50. Count nbr of clients (id) if revenue is between 100 and 1000.
In my case the expected output should be
import pandas as pd
from io import StringIO
s = """date,id,revenue, interval
2021-05-05,X1,50, 3
2021-05-05,X2,30, 3
2021-05-05,X3,20, 3
2021-05-06,X1,300, 2
2021-05-06,X2,500, 2
2021-05-01,X1,5, 1"""
df = pd.read_csv(StringIO(s))
Upvotes: 0
Views: 51
Reputation: 5331
You first need to cut your data.
df['bins'] = pd.cut(df['revenue'], bins=
pd.IntervalIndex.from_tuples([(0, 10), (20, 50), (100, 1000)], closed='both'))
The data frame now looks like this:
date id revenue bins
0 2021-05-05 X1 50 [20, 50]
1 2021-05-05 X2 30 [20, 50]
2 2021-05-05 X3 20 [20, 50]
3 2021-05-06 X1 300 [100, 1000]
4 2021-05-06 X2 500 [100, 1000]
5 2021-05-01 X1 5 [0, 10]
To cut properly, create an interval index which includes the bounds (ie, passing closed='both'
). This all needs to be specified explicitly rather than with just a list of scalars, because of the bound-inclusivity. Note that you'll get unexpected output if your revenue falls outside one of these discontinuous intervals, eg, if revenue is 75.
Then just groupby and count.
bin_count = df.groupby('bins')['id'].count()
bin_count
looks like this:
[0, 10] 1
[20, 50] 3
[100, 1000] 2
Name: id, dtype: int64
If you want to join the numbers back to your data frame, which seems to be what you have in your expected output under the name interval
(which is a misnomer, use a different name) use pd.Series.map
from the bins: df['bin_count'] = df['bins'].map(bin_count)
.
Upvotes: 1