Andrew89
Andrew89

Reputation: 3

Creating Proc Frequency tables in Python with count and percentage

I'm relatively new to Python but made some solid progress however i've come unstuck with converting my process in SAS utilising Proc Freq one way frequency tables into python.

A scaled down version of my dataset is as follows:

import pandas as pd

data = {'Code':  [719, 719,719,719,719,719],
        'National': [1001, 1001,1001,1001,1001,1001],
        'Regional': [3005,3005,3005,3005,3005,3005],
         'Local': [2002,2002,2002,2002,2002,2002],
        'Collection Variable': [1,1,2,3,3,3]}


df = pd.DataFrame (data, columns = ['Code','National', 'Regional', 'Local','Collection Variable'])

Outputs in SAS would take this and add a frequency and percentage column for the collection variable columns, results as below:

National  Regional  Local  Code  Collection  Count  Percent
                                  Variable   

  1001      3005    2002   719       1         2      0.333333
  1001      3005    2002   719       2         1      0.166667
  1001      3005    2002   719       3         3      0.5

Is anything like this possible in python? I've managed to get the percentages, but running in separate columns but not able to get the additional columns.

Furthermore, this would happen for a number of different Collection Variables. In SAS it creates a new table for each variable, is that possible in python with some kind of loop?

Thanks for any help.

Upvotes: 0

Views: 3129

Answers (1)

Sahith Kurapati
Sahith Kurapati

Reputation: 1715

These 2 lines should be enough to get the Count and Percentage columns (it also accounts for nan values):

df = df = df.astype(str).groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'Count'})
df['Percentage'] = df[df['Collection Variable'] != 'nan']['Count'].apply(lambda x: x/sum(df[df['Collection Variable'] != 'nan']['Count']))

Output:

National  Regional  Local  Code  Collection  Count  Percent
                                  Variable   

  1001      3005    2002   719       1         2      0.333333
  1001      3005    2002   719       2         1      0.166667
  1001      3005    2002   719       3         3      0.5

Hope that helps :)

Upvotes: 0

Related Questions