Reputation: 73
I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
Upvotes: 1
Views: 1107
Reputation: 56
You can use the globals()
method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table()
function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]
) in the .pivot_table()
method to handle the different column names of the dfs coming through the loop
Upvotes: 1