Reputation: 477
I have a trivial question. I have a very large df with lots of columns. I am trying to find the most efficient way to bin all the columns with different bin sizes and create a new df. Here is an example for only binning a single column:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)), columns=list('ABCD'))
newDF = pd.cut(df.A, 2, precision=0)
newDF
0 (9.0, 18.0]
1 (-0.0, 9.0]
2 (-0.0, 9.0]
3 (-0.0, 9.0]
4 (9.0, 18.0]
Name: A, dtype: category
Categories (2, interval[float64]): [(-0.0, 9.0] < (9.0, 18.0]]
Upvotes: 2
Views: 176
Reputation: 863321
If want processing each column separately use DataFrame.apply
:
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)), columns=list('ABCD'))
newDF = df.apply(lambda x: pd.cut(x, 2, precision=0))
print (newDF)
A B C D
0 (2.0, 4.0] (8.0, 15.0] (7.0, 13.0] (12.0, 18.0]
1 (2.0, 4.0] (8.0, 15.0] (7.0, 13.0] (12.0, 18.0]
2 (4.0, 7.0] (8.0, 15.0] (13.0, 19.0] (12.0, 18.0]
3 (4.0, 7.0] (8.0, 15.0] (7.0, 13.0] (5.0, 12.0]
4 (4.0, 7.0] (1.0, 8.0] (7.0, 13.0] (5.0, 12.0]
If want processing all columns by same bins use DataFrame.stack
for MultiIndex Series
, apply cut
and reshape back by Series.unstack
:
newDF = pd.cut(df.stack(), 2, precision=0).unstack()
print (newDF)
A B C D
0 (10.0, 19.0] (10.0, 19.0] (10.0, 19.0] (-0.0, 10.0]
1 (10.0, 19.0] (10.0, 19.0] (-0.0, 10.0] (-0.0, 10.0]
2 (-0.0, 10.0] (10.0, 19.0] (-0.0, 10.0] (-0.0, 10.0]
3 (-0.0, 10.0] (-0.0, 10.0] (10.0, 19.0] (-0.0, 10.0]
4 (10.0, 19.0] (10.0, 19.0] (-0.0, 10.0] (-0.0, 10.0]
Upvotes: 2