Reputation: 51
Here is my problem. We have an Excel based report that business users enter comments into two separate fields, as well as selecting a code form a drop down. We then have a manual process that collects those files and pushes the comments and codes to a Snowflake table to be able to use in various reports.
I am trying to improve the process with a Python script that will collect the files, copy them to a staging_folder location, then read in the data from the sheet, append it all together, do some cleanup and push to Snowflake. The plan is that this would be completely automated - but this is where we run into issues.
Initial step works perfectly. I have a loop that grabs the files based on the previous business day date, copies them to a staging folder. There are typically 32 files each day.
Next step reads those files to append to a dataframe. Here is the function that is loading the Excel files in my Python script.
def load_files():
file_list = glob.glob(file_path + r'\*')
df = pd.DataFrame()
print("Importing data to Pandas DF...")
for file in file_list:
try:
wb = load_workbook(file)
ws = wb["Daily Outs"]
data = ws.values
cols = next(data)[1:]
data = list(data)
idx = [r[0] for r in data]
data = (islice(r, 1, None) for r in data)
data_1 = pd.DataFrame(data, index=idx, columns=cols)
df = df.append(data_1, sort=False)
print(file + " Imported to Df...")
except Exception as e:
print("Error: " + e + " When attempting to open file: " + file)
# error_notify(e)
print(df.head(10))
return df
The problem is when we have files that have some sort of corruption. The files when opened manually will show an error like the one below.
I thought with my try, except code above this would catch an error like this and alert me with the error_notify(e) function. However, we get a result where the Python script crashes with an error like this: zipfile.BadZipFile: File is not a zip file During handling of the above exception, another exception occurred.
There is more to the error, but I only copied & pasted this part in some communication with some folks int he office. Impossible to replicate the error on our own - I have no idea how the files get corrupted in this way - except that there are multiple people accessing the files throughout the day.
The way to make the file readable is completely manual - we must open the file, get that error, hit yes, and save the file over the existing one. Then re-launch the script. But since the try, except isn't catching it and alerting us to the failure, we have to run the script manually to see if it works or not.
Two questions - am I doing something incorrect in my try, except command? I am admittedly weak in error catching so my first thought is there is more I can do there to make that work. Secondly, is there a Python way to get past that error in the Excel workbook files?
Here is the error text: Traceback (most recent call last): File "G:/Replenishment/Reporting/00 - I&A Replenishment/02 - Service Level/Daily Outs Comment Capture/Python/daily_outs_missed_files.py", line 48, in load_files wb = load_workbook(file) File "C:\ProgramData\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 314, in load_workbook data_only, keep_links) File "C:\ProgramData\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 124, in init self.archive = _validate_archive(fn) File "C:\ProgramData\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 96, in _validate_archive archive = ZipFile(filename, 'r') File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1222, in init self._RealGetContents() File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1289, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "G:/Replenishment/Reporting/00 - I&A Replenishment/02 - Service Level/Daily Outs Comment Capture/Python/daily_outs_missed_files.py", line 123, in <module>
main()
File "G:/Replenishment/Reporting/00 - I&A Replenishment/02 - Service Level/Daily Outs Comment Capture/Python/daily_outs_missed_files.py", line 86, in main
df_output = df_clean()
File "G:/Replenishment/Reporting/00 - I&A Replenishment/02 - Service Level/Daily Outs Comment Capture/Python/daily_outs_missed_files.py", line 68, in df_clean
df = load_files()
File "G:/Replenishment/Reporting/00 - I&A Replenishment/02 - Service Level/Daily Outs Comment Capture/Python/daily_outs_missed_files.py", line 61, in load_files
print("Error: " + e + " When attempting to open file: " + file)
TypeError: can only concatenate str (not "BadZipFile") to str
Upvotes: 0
Views: 5180
Reputation: 429
Your try/except code looks correct. All user defined exceptions in python should be classes based on Exception. See BaseException and and Exception in python documentation : "Exception (..) All user-defined exceptions should also be derived from this class" see also the exception class hierarchy tree at the end of the python doc sesction.
If your python script "crashes" it means one of the library procedures throws an exception which is not based on the Exception class, something that "should not" be. You could look at the Traceback and try catching the offending exception type separately, or find what part of the source code and which library is the cause, fix it and submit a PR. Here are two examples of a good and bad way of deriving own exceptions
class MyBadError(BaseException):
"""
my bad exception, do not make yours that way
"""
pass
instead of recommended
class MyGoodError(Exception):
"""
exception based on the Exception
"""
pass
Where and what exactly fails is a bit of mystery still but the problems with your exception from the Traceback is not new, see zipfile.BadZipfile issue in pandas discussion. Note that xlrd used by pandas to read Excel workbooks data is currently a "no-maintainer-ware" declaration about xlrd from the authors and in case of any issues the recommendation is to use openpyxl instead or fix any issues yourself (pandas maintainers are doing pontius pilate on that, but happily use xlrd as a dependency). I suggest you catch the BadZipfile as a special known corruption error separately from all other exceptions, see python error handling tutorial for example code (you probably already have seen it, this is for other readers). If that does not work I can trace it in the source code of your libraries / python modules to the exact offending section and find the culprit, if you reach out directly.
Upvotes: 1