JonR
JonR

Reputation: 111

Merge multiple sheets from multiple Excel workbooks into a single Pandas dataframe

I'm setting up a data pipeline for data from one of our tenants, who delivers data in Excel files: one workbook for each week, and each sheet in the workbook represents a day. We have no control over the format, but the process needs to be flexible enough to handle the variety of names in the workbooks and sheets. This also needs to be in Python, since we aren't allowed to execute macros or VBA (not my policy).

I've tried using pd.read_Excel() in a loop, but it will currently return the output as a dictionary of dataframes, and the pd.concat() function throws an error. I need to define a process that does the following:

  1. Pulls a list of workbooks from a directory, then
  2. Pulls a list of sheet names from that workbook, then
  3. Loops through each sheet and reads the data into an empty dataframe, then
  4. Repeats for each workbook, appending the results from each into a final, single dataframe.

Eventually, this will be set up into a monthly script that will be pointed at a directory to import the latest data. I'm currently testing it on a few workbooks before I run the entire year's worth of data. Some of the print functions in the code are just for temp output checks. My current code, with the errors being thrown:

import pandas as pd
import os

# Once functioning, re-implement as a def
# Get list of Excel files in the working directory
os.getcwd()
os.chdir('d:\\projects\\chassis\\data')
os.getcwd()
files = os.listdir()
wkbks = [f for f in files if f[-4:] =='xlsx']

Output: ['1-1 through 1-5.xlsx', '1-14 through 1-19.xlsx', '1-21 through 1-26.xlsx']

If I run it on the full list of workbooks:

# skiprows and usecols are to handle original sheet format, which has extraneous header rows
df = pd.read_excel(wkbks, sheet_name=None, ignore_index=True, skiprows=6, usecols=8)
cdf = pd.concat(df.values())

it throws the error: Invalid file path or buffer object type: <class 'list'>

If I run it with just a single workbook using df = pd.read_excel(wkbks[1], sheet_name=None, ignore_index=True, skiprows=6, usecols=8) then it returns a dictionary of lists that looks like this:

OrderedDict([('Mon 1.14 to 1.15',     Carrier ID                              Carrier Name  Mission  \
0         XPOR                   XPO PORT SERVICES, INC.  Dropoff   
1         CCOO                      DECO LOGISTICS, INC.  Dropoff   
2         AMPF          AMERICAN PACIFIC FORWARDERS, INC  Dropoff   
3         GPON           GOLD POINT TRANSPORTATION, INC.  Dropoff   
4         FXTR                  FOX TRANSPORTATION, INC.  Dropoff   

Trying to concatenate this dictionary with df2 = pd.concat([pd.concat(v) for k,v in df.items()]) throws the error: TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

I feel like this is close to working, but I'm missing some step to:

  1. Handle the list of files all at once, without having to process one at a time
  2. Transform the dictionary of dataframes correctly.

Upvotes: 0

Views: 4488

Answers (1)

jezrael
jezrael

Reputation: 862406

You can use glob for all files from folder, then join together by concat and last DataFrame.append to final DataFrame:

import glob

all_data = pd.DataFrame()
path = 'd:/projects/chassis/data/*.xlsx'
for f in glob.glob(path):
    df = pd.read_excel(f, sheet_name=None, ignore_index=True, skiprows=6, usecols=8)
    cdf = pd.concat(df.values())
    all_data = all_data.append(cdf,ignore_index=True)
print(all_data)

Upvotes: 3

Related Questions