Viktor.w
Viktor.w

Reputation: 2297

Pandas: divide column into three bins of exact same size

What I have right now looks like this: spread

0  0.00000787  
1  0.00000785  
2  0.00000749  
3  0.00000788  
4  0.00000786  
5  0.00000538  
6  0.00000472  
7  0.00000759  

And I would like to add a new column next to it, and if the value of spread in between (for example) 0 and 0.00005 then it is part of bin A, if (for example) between 0.00005 and 0.0006 then bin B (there are three bins in total). What I have tried so far:

minspread = df['spread'].min()
maxspread = df['spread'].max()
born = (float(maxspread)-float(minspread))/3
born1 = born + float(minspread)
born2 = float(maxspread) - born


df['Bin'] = df['spread'].apply(lambda x: 'A' if x < born1 else ( 'B' if born1 < x <= born2 else 'C'))

But when I do so everything ends up in the Bin A:

    spread    Bin                              
0  0.00000787   A   
1  0.00000785   A   
2  0.00000749   A                                  
3  0.00000788   A                                  
4  0.00000786   A 

Does anyone have an idea on how to divide the column 'spread' in three bins (A-B-C) with the same number of observations in it? Thanks!

Upvotes: 1

Views: 2084

Answers (1)

jezrael
jezrael

Reputation: 862581

If get error:

unsupported operand type(s) for +: 'decimal.Decimal' and 'float'

It means the column type is Decimal, which works poorly with pandas, and should be converted to numeric.

One possible solution is to multiply columns by some big number e.g. 10e15 and convert to integer to avoid lost precision if converting to floats and then use qcut:

#sample data
#from decimal import Decimal
#df['spread'] = [Decimal(x) for x in df['spread']]

df['spread1'] = (df['spread'] * 10**15).astype(np.int64)
df['bins'] = pd.qcut(df['spread1'], 3, labels=list('ABC'))
print (df)

       spread     spread1 bins
0  0.00000787  7870000000    C
1  0.00000785  7850000000    B
2  0.00000749  7490000000    A
3  0.00000788  7880000000    C
4  0.00000786  7860000000    C
5  0.00000538  5380000000    A
6  0.00000472  4720000000    A
7  0.00000759  7590000000    B

Solution with no new column:

s = (df['spread'] * 10**15).astype(np.int64)
df['bins'] = pd.qcut(s, 3, labels=list('ABC'))
print (df)

       spread bins
0  0.00000787    C
1  0.00000785    B
2  0.00000749    A
3  0.00000788    C
4  0.00000786    C
5  0.00000538    A
6  0.00000472    A
7  0.00000759    B

Upvotes: 3

Related Questions