Dine
Dine

Reputation: 47

Loop over each column of a dataframe separately

I am currently working with a dataframe imported from Excel. The head of the dataframe looks like this:

   CRED  ACBA  PAYMS PUR
0     0     2   2    2
1     0     4   2    2
2     0     1   2    3
3     1     1   2    2
4     0     2   4    3

I subdivided this dataframe into a smaller dataframe for when CRED = 1

df_CRED1 = df_original[df_original.CRED == 1]

And I applied the following operations to column 'ACBA'

list_frequency_cred1 = [df_CRED1['ACBA'].value_counts()]
frequency_cred1_total = sum(df_CRED1['ACBA'].value_counts())
matrix_frequency_cred1 = DataFrame(data = list_frequency_cred1)
matrix_frequency_cred1['Total'] = frequency_cred1_total 
matrix_frequency_cred1.rename(index = {'ACBA':'CRED1'}, inplace=True)

In order to obtain the following table:

        1   2  3   4  Total
CRED1   9  11  1   7     28

I am now looking to create a loop that would apply this sequence of operations I did on column ACBA to all other columns in order to obtain separate frequency tables for each column:

ACBA
        1   2  3   4  Total
CRED1   9  11  1   7     28

PAYMS
        1   2  3   4  Total
CRED1   4   5  6   7     22

etc... 

I don't understand how to set the loop so as to consider each column separately. Later in my code I will have to apply other operations to the same dataframe so I would like to understand the underlying logic (rather than find functions related to frequencies). Thank you

Upvotes: 2

Views: 1154

Answers (2)

dgoodman1
dgoodman1

Reputation: 316

It looks like this is what you're trying to do, hopefully this isn't overkill:

Create test data:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 5, (150, 4)), columns=['CRED', 'ACBA', 'PAYMS', 'PUR'])
df.loc[:, ['ACBA', 'PAYMS', 'PUR']] = df[['ACBA', 'PAYMS', 'PUR']].replace(0, np.nan)
df.head()

CRED    ACBA    PAYMS   PUR
0   4   1   2   NaN
1   4   3   2   NaN
2   1   NaN 1   3
3   0   NaN NaN 3
4   4   1   4   2

Compute value counts:

def get_value_counts(grp):
    """Compute value counts for each column in DataFrame subset."""
    return grp.drop('CRED', axis=1).apply(pd.value_counts)

vc = df.groupby('CRED').apply(get_value_counts)
vc.head()

          ACBA  PAYMS  PUR
CRED                      
0    1.0     2      1    7
     2.0     9      7    1
     3.0     5      5   13
     4.0     3      4    3
1    1.0     7      7    6

Reformat values to categories for inclusion of Total:

vc = (vc
      .reset_index(level=1)
      .rename(columns={'level_1': 'VALUE'})
      .assign(VALUE=lambda frame: (frame.VALUE
                                   .astype('int')
                                   .astype('category')
                                   .cat.add_categories(['Total'])))
      .set_index('VALUE', append=True))
vc.columns.names = ['VARIABLE']
vc.head()

VARIABLE    ACBA  PAYMS  PUR
CRED VALUE                  
0    1         2      1    7
     2         9      7    1
     3         5      5   13
     4         3      4    3
1    1         7      7    6

Compute totals and put 'Total' label in columns:

vc_totals = vc.groupby(level=0).sum().astype('int')
idx = pd.MultiIndex.from_product([vc_totals.columns, ['Total']], names=['VARIABLE', 'VALUE'])
vc_totals.columns = idx
vc_totals.head()

VARIABLE  ACBA PAYMS   PUR
VALUE    Total Total Total
CRED                      
0           19    17    24
1           28    28    28
2           27    22    26
3           16    19    19
4           33    31    26

Combine value counts and their totals:

vc_results = vc.unstack(fill_value=0).join(vc_totals).sort_index(axis=1, level=0)
vc_results

VARIABLE ACBA                  PAYMS                 PUR                  
VALUE       1  2   3   4 Total     1  2  3   4 Total   1   2   3   4 Total
CRED                                                                      
0           2  9   5   3    19     1  7  5   4    17   7   1  13   3    24
1           7  7   5   9    28     7  9  6   6    28   6   9   8   5    28
2           7  2   8  10    27     5  7  4   6    22   5   6   5  10    26
3           5  6   3   2    16     5  4  6   4    19   4   5   4   6    19
4          13  6  11   3    33     7  9  4  11    31   2  11   5   8    26

If you just want CRED = 1:

vc_results.loc[1].unstack()

VALUE     1  2  3  4  Total
VARIABLE                   
ACBA      7  7  5  9     28
PAYMS     7  9  6  6     28
PUR       6  9  8  5     28

Upvotes: 1

BartDur
BartDur

Reputation: 1106

I am sure there is a more efficient way of doing this (e.g. by storing the frequencies of all the columns in a single Dataframe, so that you can avoid the looping action). However if you really want to make separate dfs for each column, you could do something like this:

cols = list(df.columns)[1:]  # exclude CRED from list of cols to process
df_dict = {}
for col in cols:
    df = <your operations to generate a df>
    df_dict.update{col:df}

you can the retrieve the df you're interested in using ACBA_df = df_dict['ACBA'] for example.

Upvotes: 0

Related Questions