Reputation: 21
I am using pd.cut and binning data. After this step, I am finding the mean of data in each bin and if the difference in the mean between two bins are below a threshold, I want to merge the two bins together.
import pandas as pd
df = pd.DataFrame([{ 'col1': 7, 'val': 2},
{'col1': 20, 'val': 22},
{'col1': 11, 'val': 12},
{ 'col1': 9, 'val': 13},
{ 'col1': 14, 'val': 11}])
df['bin1']=pd.cut(df['col1'], 3)
df2 = pd.DataFrame(df.groupby('bin1')['val'].mean())
threshold = 5
Output:
val
bin1
(6.987, 11.333] 9
(11.333, 15.667] 11
(15.667, 20.0] 22
if the difference of mean of val is less than the threshold (5), then i want to merge the bins.
So the new bins now should be:
bin1
(6.987, 15.667]
(15.667, 20.0]
I don't know how to do the last step.. Thank you!
Upvotes: 2
Views: 1076
Reputation: 150785
The common technique to merge consecutive rows is to work on cumsum of negate condition. Here, that would be:
blocks = df2['val'].diff().gt(threshold).cumsum()
However, since you're working with the index, it's better to just reset it so we can work on columns:
df2 = df2.reset_index()
blocks = df2['val'].diff().gt(threshold).cumsum()
df2.groupby(blocks).agg({
'bin1': lambda x: pd.Interval(x.iloc[0].left, x.iloc[-1].right)
})
Output:
bin1
val
0 (6.987, 15.667]
1 (15.667, 20.0]
Upvotes: 1