user308827
user308827

Reputation: 21961

How to get the frequency table for two columns of a dataframe

I have the following dataframe:

data = {'ps_variable_1': ['Top 33%', 'Bottom 33%', 'Middle 33%', 'Bottom 33%', 'Middle 33%',
                          'Top 33%', 'Top 33%', 'Bottom 33%', 'Bottom 33%'],
        'ps_variable_2': ['Middle 33%', 'Middle 33%', 'Middle 33%', 'Top 33%', 'Bottom 33%',
                          'Bottom 33%', 'Bottom 33%', 'Middle 33%', 'Top 33%']}
df = pd.DataFrame(data)

   ps_variable_1 ps_variable_2
0        Top 33%    Middle 33%
1     Bottom 33%    Middle 33%
2     Middle 33%    Middle 33%
3     Bottom 33%       Top 33%
4     Middle 33%    Bottom 33%
5        Top 33%    Bottom 33%
6        Top 33%    Bottom 33%
7     Bottom 33%    Middle 33%
8     Bottom 33%       Top 33%

I want to get a new dataframe out of it with 3 columns: Top 33% Middle 33% Bottom 33%. Each column has 3 rows as below:

               Top 33%     Middle 33%   Bottom 33%
Top 33%         
Middle 33%          
Bottom 33%          

For each grid cell, I want to count the number of occurrences based on the dataframe above. E.g. the first row contains the number of cases where ps_variable_1 is in the top 33% or top third-percentile, and the respective number of cases where ps_variable_2 is in the top third, middle third and bottom third percentile respectively. How do I do that, I would prefer a solution which can be extended to other ratios as well e.g. top 20%....

Upvotes: 5

Views: 3011

Answers (1)

piRSquared
piRSquared

Reputation: 294218

Option 1
pd.crosstab

pd.crosstab(df.ps_variable_1, df.ps_variable_2)

option 2
groupby and size

df.groupby(['ps_variable_1', 'ps_variable_2']).size().unstack(fill_value=0)

Option 3
np.bincount

u1, f1 = np.unique(df.ps_variable_1.values, return_inverse=True)
u2, f2 = np.unique(df.ps_variable_2.values, return_inverse=True)
n, m = u1.size, u2.size
b = np.bincount(f1 * m + f2, minlength=n * m).reshape(n, m)
pd.DataFrame(b, u1, u2)

Result

ps_variable_2  Bottom 33%  Middle 33%  Top 33%
ps_variable_1                                 
Bottom 33%              0           2        2
Middle 33%              1           1        0
Top 33%                 2           1        0

Upvotes: 14

Related Questions