Reputation: 111
I'm setting up a data pipeline for data from one of our tenants, who delivers data in Excel files: one workbook for each week, and each sheet in the workbook represents a day. We have no control over the format, but the process needs to be flexible enough to handle the variety of names in the workbooks and sheets. This also needs to be in Python, since we aren't allowed to execute macros or VBA (not my policy).
I've tried using pd.read_Excel()
in a loop, but it will currently return the output as a dictionary of dataframes, and the pd.concat()
function throws an error. I need to define a process that does the following:
Eventually, this will be set up into a monthly script that will be pointed at a directory to import the latest data. I'm currently testing it on a few workbooks before I run the entire year's worth of data. Some of the print functions in the code are just for temp output checks. My current code, with the errors being thrown:
import pandas as pd
import os
# Once functioning, re-implement as a def
# Get list of Excel files in the working directory
os.getcwd()
os.chdir('d:\\projects\\chassis\\data')
os.getcwd()
files = os.listdir()
wkbks = [f for f in files if f[-4:] =='xlsx']
Output: ['1-1 through 1-5.xlsx', '1-14 through 1-19.xlsx', '1-21 through 1-26.xlsx']
If I run it on the full list of workbooks:
# skiprows and usecols are to handle original sheet format, which has extraneous header rows
df = pd.read_excel(wkbks, sheet_name=None, ignore_index=True, skiprows=6, usecols=8)
cdf = pd.concat(df.values())
it throws the error: Invalid file path or buffer object type: <class 'list'>
If I run it with just a single workbook using df = pd.read_excel(wkbks[1], sheet_name=None, ignore_index=True, skiprows=6, usecols=8)
then it returns a dictionary of lists that looks like this:
OrderedDict([('Mon 1.14 to 1.15', Carrier ID Carrier Name Mission \
0 XPOR XPO PORT SERVICES, INC. Dropoff
1 CCOO DECO LOGISTICS, INC. Dropoff
2 AMPF AMERICAN PACIFIC FORWARDERS, INC Dropoff
3 GPON GOLD POINT TRANSPORTATION, INC. Dropoff
4 FXTR FOX TRANSPORTATION, INC. Dropoff
Trying to concatenate this dictionary with df2 = pd.concat([pd.concat(v) for k,v in df.items()])
throws the error: TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
I feel like this is close to working, but I'm missing some step to:
Upvotes: 0
Views: 4488
Reputation: 862406
You can use glob
for all files from folder, then join together by concat
and last DataFrame.append
to final DataFrame
:
import glob
all_data = pd.DataFrame()
path = 'd:/projects/chassis/data/*.xlsx'
for f in glob.glob(path):
df = pd.read_excel(f, sheet_name=None, ignore_index=True, skiprows=6, usecols=8)
cdf = pd.concat(df.values())
all_data = all_data.append(cdf,ignore_index=True)
print(all_data)
Upvotes: 3