MaxG
MaxG

Reputation: 243

Pandas bin and count

I'm new to Pandas, please don't be too harsh ;) Let's assume my initial data frame looks like this:

#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)

#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)

I now bin the data frame according to 'size',

#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )

which results in this output:

---
(0, 1]
   flag_A flag_B flag_C      size
25  False  False   True  0.091269
40   True   True   True  0.902894
41   True   True   True  0.159964
46  False   True   True  0.494409
53  False   True   True  0.638736
73   True  False   True  0.530348
80   True  False  False  0.669700
88   True   True   True  0.858495
---
(1, 2]
   flag_A flag_B flag_C      size
...

My question is now: How can I proceed from here to get the count of True and False per flag (A,B,C) per bin? E.g. for bin=(0,1] I expect to get something like N_flag_A_true = 5, N_flag_A_false = 3, and so on. Ideally, I would like to get this information summarized by extending this data frame, or into a new data frame.

Upvotes: 1

Views: 4042

Answers (2)

Jon Clements
Jon Clements

Reputation: 142126

You can apply your group to the DF then pd.melt:

df['group'] = pd.cut(df['size'], bins=bins)
melted = pd.melt(df, id_vars='group', value_vars=['flag_A', 'flag_B', 'flag_C'])

Which'll give you:

      group variable  value
0    (6, 7]   flag_A  False
1    (3, 4]   flag_A  False
2    (4, 5]   flag_A   True
3    (7, 8]   flag_A   True
4    (6, 7]   flag_A   True
5    (1, 2]   flag_A  False
[...]

Then group by the columns and take the size of each group:

df2 = melted.groupby(['group', 'variable', 'value']).size()

Which gives you:

group   variable  value
(0, 1]  flag_A    False     3
                  True      5
        flag_B    False     3
                  True      5
        flag_C    False     1
                  True      7
(1, 2]  flag_A    False     6
                  True      8
        flag_B    False     7
                  True      7
        flag_C    False     5
                  True      9
(2, 3]  flag_A    False     7
                  True      9
        flag_B    False    11
                  True      5
        flag_C    False    13
                  True      3
        [...]

Then you'll need to re-shape that as to how you want to use it...

Upvotes: 2

Khris
Khris

Reputation: 3212

It can be achieved with multi-index groupbys, concatenating the results and unstacking:

flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()

T = pd.concat([flag_A,flag_B],axis=1)
R = pd.concat([T,flag_C],axis=1)
R.columns = ['flag_A','flag_B','flag_C']
R.index.names = [u'Bins',u'Value']
R = R.unstack('Value')

The result is:

       flag_A       flag_B       flag_C      
Value   False True   False True   False True 
Bins                                         
(0, 1]    3.0   5.0    3.0   5.0    1.0   7.0
(1, 2]    6.0   8.0    7.0   7.0    5.0   9.0
(2, 3]    7.0   9.0   11.0   5.0   13.0   3.0
(3, 4]   15.0  12.0   12.0  15.0   17.0  10.0
(4, 5]    2.0   8.0    5.0   5.0    7.0   3.0
(5, 6]    5.0   5.0    3.0   7.0    7.0   3.0
(6, 7]    1.0   5.0    NaN   6.0    3.0   3.0
(7, 8]    NaN   2.0    1.0   1.0    NaN   2.0
(8, 9]    NaN   NaN    NaN   NaN    NaN   NaN

EDIT: You can resolve the multi-index in the columns like this:

R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']

With the result:

        flag_A_F  flag_A_T  flag_B_F  flag_B_T  flag_C_F  flag_C_T
Bins                                                              
(0, 1]       3.0       5.0       3.0       5.0       1.0       7.0
(1, 2]       6.0       8.0       7.0       7.0       5.0       9.0
(2, 3]       7.0       9.0      11.0       5.0      13.0       3.0
(3, 4]      15.0      12.0      12.0      15.0      17.0      10.0
(4, 5]       2.0       8.0       5.0       5.0       7.0       3.0
(5, 6]       5.0       5.0       3.0       7.0       7.0       3.0
(6, 7]       1.0       5.0       NaN       6.0       3.0       3.0
(7, 8]       NaN       2.0       1.0       1.0       NaN       2.0
(8, 9]       NaN       NaN       NaN       NaN       NaN       NaN

Upvotes: 3

Related Questions