Pandas bin and count

Question

I'm new to Pandas, please don't be too harsh ;) Let's assume my initial data frame looks like this:

#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)

#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)

I now bin the data frame according to 'size',

#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )

which results in this output:

---
(0, 1]
   flag_A flag_B flag_C      size
25  False  False   True  0.091269
40   True   True   True  0.902894
41   True   True   True  0.159964
46  False   True   True  0.494409
53  False   True   True  0.638736
73   True  False   True  0.530348
80   True  False  False  0.669700
88   True   True   True  0.858495
---
(1, 2]
   flag_A flag_B flag_C      size
...

My question is now: How can I proceed from here to get the count of True and False per flag (A,B,C) per bin? E.g. for bin=(0,1] I expect to get something like N_flag_A_true = 5, N_flag_A_false = 3, and so on. Ideally, I would like to get this information summarized by extending this data frame, or into a new data frame.

Khris · Accepted Answer

It can be achieved with multi-index groupbys, concatenating the results and unstacking:

flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()

T = pd.concat([flag_A,flag_B],axis=1)
R = pd.concat([T,flag_C],axis=1)
R.columns = ['flag_A','flag_B','flag_C']
R.index.names = [u'Bins',u'Value']
R = R.unstack('Value')

The result is:

       flag_A       flag_B       flag_C      
Value   False True   False True   False True 
Bins                                         
(0, 1]    3.0   5.0    3.0   5.0    1.0   7.0
(1, 2]    6.0   8.0    7.0   7.0    5.0   9.0
(2, 3]    7.0   9.0   11.0   5.0   13.0   3.0
(3, 4]   15.0  12.0   12.0  15.0   17.0  10.0
(4, 5]    2.0   8.0    5.0   5.0    7.0   3.0
(5, 6]    5.0   5.0    3.0   7.0    7.0   3.0
(6, 7]    1.0   5.0    NaN   6.0    3.0   3.0
(7, 8]    NaN   2.0    1.0   1.0    NaN   2.0
(8, 9]    NaN   NaN    NaN   NaN    NaN   NaN

EDIT: You can resolve the multi-index in the columns like this:

R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']

With the result:

        flag_A_F  flag_A_T  flag_B_F  flag_B_T  flag_C_F  flag_C_T
Bins                                                              
(0, 1]       3.0       5.0       3.0       5.0       1.0       7.0
(1, 2]       6.0       8.0       7.0       7.0       5.0       9.0
(2, 3]       7.0       9.0      11.0       5.0      13.0       3.0
(3, 4]      15.0      12.0      12.0      15.0      17.0      10.0
(4, 5]       2.0       8.0       5.0       5.0       7.0       3.0
(5, 6]       5.0       5.0       3.0       7.0       7.0       3.0
(6, 7]       1.0       5.0       NaN       6.0       3.0       3.0
(7, 8]       NaN       2.0       1.0       1.0       NaN       2.0
(8, 9]       NaN       NaN       NaN       NaN       NaN       NaN

Pandas bin and count

Answers (2)

Related Questions