Reputation: 21961
I have the following dataframe:
data = {'ps_variable_1': ['Top 33%', 'Bottom 33%', 'Middle 33%', 'Bottom 33%', 'Middle 33%',
'Top 33%', 'Top 33%', 'Bottom 33%', 'Bottom 33%'],
'ps_variable_2': ['Middle 33%', 'Middle 33%', 'Middle 33%', 'Top 33%', 'Bottom 33%',
'Bottom 33%', 'Bottom 33%', 'Middle 33%', 'Top 33%']}
df = pd.DataFrame(data)
ps_variable_1 ps_variable_2
0 Top 33% Middle 33%
1 Bottom 33% Middle 33%
2 Middle 33% Middle 33%
3 Bottom 33% Top 33%
4 Middle 33% Bottom 33%
5 Top 33% Bottom 33%
6 Top 33% Bottom 33%
7 Bottom 33% Middle 33%
8 Bottom 33% Top 33%
I want to get a new dataframe out of it with 3 columns: Top 33% Middle 33% Bottom 33%. Each column has 3 rows as below:
Top 33% Middle 33% Bottom 33%
Top 33%
Middle 33%
Bottom 33%
For each grid cell, I want to count the number of occurrences based on the dataframe above. E.g. the first row contains the number of cases where ps_variable_1 is in the top 33% or top third-percentile, and the respective number of cases where ps_variable_2 is in the top third, middle third and bottom third percentile respectively. How do I do that, I would prefer a solution which can be extended to other ratios as well e.g. top 20%....
Upvotes: 5
Views: 3011
Reputation: 294218
Option 1
pd.crosstab
pd.crosstab(df.ps_variable_1, df.ps_variable_2)
option 2
groupby
and size
df.groupby(['ps_variable_1', 'ps_variable_2']).size().unstack(fill_value=0)
Option 3
np.bincount
u1, f1 = np.unique(df.ps_variable_1.values, return_inverse=True)
u2, f2 = np.unique(df.ps_variable_2.values, return_inverse=True)
n, m = u1.size, u2.size
b = np.bincount(f1 * m + f2, minlength=n * m).reshape(n, m)
pd.DataFrame(b, u1, u2)
Result
ps_variable_2 Bottom 33% Middle 33% Top 33%
ps_variable_1
Bottom 33% 0 2 2
Middle 33% 1 1 0
Top 33% 2 1 0
Upvotes: 14