Reputation: 243
I'm new to Pandas, please don't be too harsh ;) Let's assume my initial data frame looks like this:
#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)
#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)
I now bin the data frame according to 'size',
#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )
which results in this output:
---
(0, 1]
flag_A flag_B flag_C size
25 False False True 0.091269
40 True True True 0.902894
41 True True True 0.159964
46 False True True 0.494409
53 False True True 0.638736
73 True False True 0.530348
80 True False False 0.669700
88 True True True 0.858495
---
(1, 2]
flag_A flag_B flag_C size
...
My question is now: How can I proceed from here to get the count of True and False per flag (A,B,C) per bin? E.g. for bin=(0,1] I expect to get something like N_flag_A_true = 5, N_flag_A_false = 3, and so on. Ideally, I would like to get this information summarized by extending this data frame, or into a new data frame.
Upvotes: 1
Views: 4042
Reputation: 142126
You can apply your group to the DF then pd.melt:
df['group'] = pd.cut(df['size'], bins=bins)
melted = pd.melt(df, id_vars='group', value_vars=['flag_A', 'flag_B', 'flag_C'])
Which'll give you:
group variable value
0 (6, 7] flag_A False
1 (3, 4] flag_A False
2 (4, 5] flag_A True
3 (7, 8] flag_A True
4 (6, 7] flag_A True
5 (1, 2] flag_A False
[...]
Then group by the columns and take the size of each group:
df2 = melted.groupby(['group', 'variable', 'value']).size()
Which gives you:
group variable value
(0, 1] flag_A False 3
True 5
flag_B False 3
True 5
flag_C False 1
True 7
(1, 2] flag_A False 6
True 8
flag_B False 7
True 7
flag_C False 5
True 9
(2, 3] flag_A False 7
True 9
flag_B False 11
True 5
flag_C False 13
True 3
[...]
Then you'll need to re-shape that as to how you want to use it...
Upvotes: 2
Reputation: 3212
It can be achieved with multi-index groupbys, concatenating the results and unstacking:
flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()
T = pd.concat([flag_A,flag_B],axis=1)
R = pd.concat([T,flag_C],axis=1)
R.columns = ['flag_A','flag_B','flag_C']
R.index.names = [u'Bins',u'Value']
R = R.unstack('Value')
The result is:
flag_A flag_B flag_C
Value False True False True False True
Bins
(0, 1] 3.0 5.0 3.0 5.0 1.0 7.0
(1, 2] 6.0 8.0 7.0 7.0 5.0 9.0
(2, 3] 7.0 9.0 11.0 5.0 13.0 3.0
(3, 4] 15.0 12.0 12.0 15.0 17.0 10.0
(4, 5] 2.0 8.0 5.0 5.0 7.0 3.0
(5, 6] 5.0 5.0 3.0 7.0 7.0 3.0
(6, 7] 1.0 5.0 NaN 6.0 3.0 3.0
(7, 8] NaN 2.0 1.0 1.0 NaN 2.0
(8, 9] NaN NaN NaN NaN NaN NaN
EDIT: You can resolve the multi-index in the columns like this:
R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']
With the result:
flag_A_F flag_A_T flag_B_F flag_B_T flag_C_F flag_C_T
Bins
(0, 1] 3.0 5.0 3.0 5.0 1.0 7.0
(1, 2] 6.0 8.0 7.0 7.0 5.0 9.0
(2, 3] 7.0 9.0 11.0 5.0 13.0 3.0
(3, 4] 15.0 12.0 12.0 15.0 17.0 10.0
(4, 5] 2.0 8.0 5.0 5.0 7.0 3.0
(5, 6] 5.0 5.0 3.0 7.0 7.0 3.0
(6, 7] 1.0 5.0 NaN 6.0 3.0 3.0
(7, 8] NaN 2.0 1.0 1.0 NaN 2.0
(8, 9] NaN NaN NaN NaN NaN NaN
Upvotes: 3