Harsh780
Harsh780

Reputation: 13

Data Extraction from multiple excel files in pandas dataframe

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in the pandas data frame. The structuring of the tabs in each of the excel files is the same. Each tab of the excel file should be a separate data frame. As of now, I have created a list of data frames for each excel file that holds all the data from all the tabs as it is concatenated. But, I'm trying to find a way to access each excel from a data structure and each tab of that excel file as a separate data frame. Below mentioned is the current code. Any improvisation would be appreciated!! Please let me know if anything else is needed.

#Assigning the path to the folder variable
folder = 'specified_path'

#Getting the list of files from the assigned path
excel_files = [file for file in os.listdir(folder)]

list_of_dfs = []
for file in excel_files :
    df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
    df['excelfile_name'] = file.split('.')[0]
    list_of_dfs.append(df)

Upvotes: 0

Views: 872

Answers (2)

Heelara
Heelara

Reputation: 979

To create a separate dataframe for each tab (with duplicated content) in an Excel file, one could iterate over index level 0 values and index with it:

df = pd.concat(pd.read_excel(filename, sheet_name=None))
list_of_dfs = []
for tab in df.index.get_level_values(0).unique():
    tab_df = df.loc[tab]
    list_of_dfs.append(tab_df)

For illustration, here is the dataframe content after reading an Excel file with 3 tabs: full dataframe

After running the above code, here is the content of list_of_dfs:

[        Date  Reviewed  Adjusted
 0 2022-07-11        43        20
 1 2022-07-18        16         8
 2 2022-07-25         8         3
 3 2022-08-01        17         3
 4 2022-08-15        14         6
 5 2022-08-22        12         5
 6 2022-08-29         8         4,
         Date  Reviewed  Adjusted
 0 2022-07-11        43        20
 1 2022-07-18        16         8
 2 2022-07-25         8         3
 3 2022-08-01        17         3
 4 2022-08-15        14         6
 5 2022-08-22        12         5
 6 2022-08-29         8         4,
         Date  Reviewed  Adjusted
 0 2022-07-11        43        20
 1 2022-07-18        16         8
 2 2022-07-25         8         3
 3 2022-08-01        17         3
 4 2022-08-15        14         6
 5 2022-08-22        12         5
 6 2022-08-29         8         4]

Upvotes: 1

Alex
Alex

Reputation: 717

I would propose to change the line

    df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)

to

    df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None))
    df.index = df.index.get_level_values(0)
    df.reset_index().rename({'index':'Tab'}, axis=1)

Upvotes: 1

Related Questions