Reputation: 47
I am currently working with a dataframe imported from Excel. The head of the dataframe looks like this:
CRED ACBA PAYMS PUR
0 0 2 2 2
1 0 4 2 2
2 0 1 2 3
3 1 1 2 2
4 0 2 4 3
I subdivided this dataframe into a smaller dataframe for when CRED = 1
df_CRED1 = df_original[df_original.CRED == 1]
And I applied the following operations to column 'ACBA'
list_frequency_cred1 = [df_CRED1['ACBA'].value_counts()]
frequency_cred1_total = sum(df_CRED1['ACBA'].value_counts())
matrix_frequency_cred1 = DataFrame(data = list_frequency_cred1)
matrix_frequency_cred1['Total'] = frequency_cred1_total
matrix_frequency_cred1.rename(index = {'ACBA':'CRED1'}, inplace=True)
In order to obtain the following table:
1 2 3 4 Total
CRED1 9 11 1 7 28
I am now looking to create a loop that would apply this sequence of operations I did on column ACBA to all other columns in order to obtain separate frequency tables for each column:
ACBA
1 2 3 4 Total
CRED1 9 11 1 7 28
PAYMS
1 2 3 4 Total
CRED1 4 5 6 7 22
etc...
I don't understand how to set the loop so as to consider each column separately. Later in my code I will have to apply other operations to the same dataframe so I would like to understand the underlying logic (rather than find functions related to frequencies). Thank you
Upvotes: 2
Views: 1154
Reputation: 316
It looks like this is what you're trying to do, hopefully this isn't overkill:
Create test data:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 5, (150, 4)), columns=['CRED', 'ACBA', 'PAYMS', 'PUR'])
df.loc[:, ['ACBA', 'PAYMS', 'PUR']] = df[['ACBA', 'PAYMS', 'PUR']].replace(0, np.nan)
df.head()
CRED ACBA PAYMS PUR
0 4 1 2 NaN
1 4 3 2 NaN
2 1 NaN 1 3
3 0 NaN NaN 3
4 4 1 4 2
Compute value counts:
def get_value_counts(grp):
"""Compute value counts for each column in DataFrame subset."""
return grp.drop('CRED', axis=1).apply(pd.value_counts)
vc = df.groupby('CRED').apply(get_value_counts)
vc.head()
ACBA PAYMS PUR
CRED
0 1.0 2 1 7
2.0 9 7 1
3.0 5 5 13
4.0 3 4 3
1 1.0 7 7 6
Reformat values to categories for inclusion of Total:
vc = (vc
.reset_index(level=1)
.rename(columns={'level_1': 'VALUE'})
.assign(VALUE=lambda frame: (frame.VALUE
.astype('int')
.astype('category')
.cat.add_categories(['Total'])))
.set_index('VALUE', append=True))
vc.columns.names = ['VARIABLE']
vc.head()
VARIABLE ACBA PAYMS PUR
CRED VALUE
0 1 2 1 7
2 9 7 1
3 5 5 13
4 3 4 3
1 1 7 7 6
Compute totals and put 'Total' label in columns:
vc_totals = vc.groupby(level=0).sum().astype('int')
idx = pd.MultiIndex.from_product([vc_totals.columns, ['Total']], names=['VARIABLE', 'VALUE'])
vc_totals.columns = idx
vc_totals.head()
VARIABLE ACBA PAYMS PUR
VALUE Total Total Total
CRED
0 19 17 24
1 28 28 28
2 27 22 26
3 16 19 19
4 33 31 26
Combine value counts and their totals:
vc_results = vc.unstack(fill_value=0).join(vc_totals).sort_index(axis=1, level=0)
vc_results
VARIABLE ACBA PAYMS PUR
VALUE 1 2 3 4 Total 1 2 3 4 Total 1 2 3 4 Total
CRED
0 2 9 5 3 19 1 7 5 4 17 7 1 13 3 24
1 7 7 5 9 28 7 9 6 6 28 6 9 8 5 28
2 7 2 8 10 27 5 7 4 6 22 5 6 5 10 26
3 5 6 3 2 16 5 4 6 4 19 4 5 4 6 19
4 13 6 11 3 33 7 9 4 11 31 2 11 5 8 26
If you just want CRED = 1:
vc_results.loc[1].unstack()
VALUE 1 2 3 4 Total
VARIABLE
ACBA 7 7 5 9 28
PAYMS 7 9 6 6 28
PUR 6 9 8 5 28
Upvotes: 1
Reputation: 1106
I am sure there is a more efficient way of doing this (e.g. by storing the frequencies of all the columns in a single Dataframe, so that you can avoid the looping action). However if you really want to make separate dfs for each column, you could do something like this:
cols = list(df.columns)[1:] # exclude CRED from list of cols to process
df_dict = {}
for col in cols:
df = <your operations to generate a df>
df_dict.update{col:df}
you can the retrieve the df you're interested in using ACBA_df = df_dict['ACBA']
for example.
Upvotes: 0