procaffienator
procaffienator

Reputation: 21

How to merge two bins in a pandas data frame?

I am using pd.cut and binning data. After this step, I am finding the mean of data in each bin and if the difference in the mean between two bins are below a threshold, I want to merge the two bins together.


import pandas as pd
df = pd.DataFrame([{ 'col1': 7, 'val': 2},
                   {'col1':   20, 'val': 22},
                   {'col1':  11, 'val': 12},
                   { 'col1': 9, 'val': 13},
                   { 'col1':   14, 'val': 11}])


df['bin1']=pd.cut(df['col1'], 3)

df2 = pd.DataFrame(df.groupby('bin1')['val'].mean())

threshold = 5

Output:


                   val
bin1    
(6.987, 11.333]     9
(11.333, 15.667]    11
(15.667, 20.0]      22

if the difference of mean of val is less than the threshold (5), then i want to merge the bins.

So the new bins now should be:

                 
bin1    
(6.987, 15.667]     
(15.667, 20.0]      

I don't know how to do the last step.. Thank you!

Upvotes: 2

Views: 1076

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

The common technique to merge consecutive rows is to work on cumsum of negate condition. Here, that would be:

blocks = df2['val'].diff().gt(threshold).cumsum()

However, since you're working with the index, it's better to just reset it so we can work on columns:

df2 = df2.reset_index()
blocks = df2['val'].diff().gt(threshold).cumsum()
df2.groupby(blocks).agg({
    'bin1': lambda x: pd.Interval(x.iloc[0].left, x.iloc[-1].right)
})

Output:

                bin1
val                 
0    (6.987, 15.667]
1     (15.667, 20.0]

Upvotes: 1

Related Questions