Ronald Sanchez
Ronald Sanchez

Reputation: 73

call dataframe from list of dataframes python

I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.

I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:

Code so far:

data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']]; 

 
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])

groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];

def sex_race_summary(groupbyColumns):
        grouplist = groupbyColumns[grouping].copy();

        # create new column to capture the variable names for aggregation and lowercase the fields
        df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();

        # create new column to capture the variable names for aggregation and lowercase the fields
        df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();

        return df;

for grouping in range(len(groupbyColumns)):
    exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');

print(df_0)

# create a dictionary
dict_of_dfs = dict();

# a list of the dataframes
df_names = [];

for i in range(len(groupbyColumns)):
    df_names.append('df_'+ str(i))
    print (df_names)

What I'd like to do next is:

# loop through every dataframe and transpose the dataframe
for i in df_names:

   df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();

   print(i)

The index of the list matches the suffix of the dataframe.

But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created

Thanks for your help!

Update based on R Y A N comment:

Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns

groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];

for grouping in range(len(groupbyColumns)):
    grouplist = groupbyColumns[grouping].copy();
    prefix = grouplist[0]; 

    # create new column to capture the variable names for aggregation and lowercase the fields
    df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);

    # create new column to capture the variable names for aggregation and lowercase the fields
    df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();

    #transpose the dataframe from long to wide
    df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);

    # create new df with the index as suffix
    globals()['%s_summary'%prefix] = df;
    globals()['%s_summary_transposed' %prefix] = df_transposed;

Upvotes: 1

Views: 1107

Answers (1)

r y an
r y an

Reputation: 56

You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:

# loop through every dataframe and transpose the dataframe
for i in df_names:
    # call globals() to return a dictionary of global vars and index on i
    i_df = globals()[i]
    # changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
    df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
    print(df_transposed)

However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop

Upvotes: 1

Related Questions