Reputation: 1155
There are some .xlsx
files in an specific directory that are corrupted, because the windows message when trying to open the workbook is the following:
Excel cannot open the file 'filename.xlsx' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.`
I am wondering if it is possible to detect these corrupted files and remove it from directory.
My trial:
############### path settlement and file names ##########
path_reportes = os.path.join(os.getcwd(), 'Reports', 'xlsx_folder')
file_names = os.listdir(path_reportes)
overall_df = dict()
############## concatenate all reports ##################
for file_name in file_names:
data_file_path = os.path.join(path_reportes, file_name)
"""
try open spreadsheets, save them and store them in a dictionary key
except when the file is corrupted, if so, remove it from the
folder
"""
try:
# Start by opening the spreadsheet and selecting the main sheet
workbook = openpyxl.load_workbook(filename=data_file_path)
sheet = workbook.active
# Save the spreadsheet
workbook.save(filename=data_file_path)
df_report_dict = pd.read_excel(data_file_path, sheet_name=None, engine='openpyxl')
for key in df_report_dict:
df_report_dict[key]['report_name'] = file_name
try:
overall_df[key] = overall_df[key].append(df_report_dict[key], ignore_index=True)
except:
overall_df[key] = df_report_dict[key]
# when file corrupted then remove it from the folder
except BadZipFile:
os.remove(data_file_path)
Which throws the next error:
NameError: name 'BadZipFile' is not defined
Is it possible to detect corrupted files? How could I handle them?
Upvotes: 0
Views: 860
Reputation: 44148
If you are still having an issue with BadZipFile
not defined, then:
Since the exception class BadZipFile
is in module zipfile
, you just need an import
statement such as:
from zipfile import BadZipFile
You should then be able to handle the exception.
Upvotes: 1
Reputation: 562
Scenario: I've created three identical excel files in a directory named xlsx_folder
, and want to merge all the files into a data frame
. Instead of using os
module, I recommend the use of glob
for this purpose.
import os # for deleting corrupted file
import glob # to list out a specific file type
import pandas as pd
# here is a list of all the file in the directory
print(glob.glob("xlsx_folder/*.xlsx"))
Output:
['xlsx_folder\\file1 - Copy (2).xlsx',
'xlsx_folder\\file1 - Copy.xlsx',
'xlsx_folder\\file1.xlsx',
'xlsx_folder\\~$file1.xlsx']
Note: In windows when an excel file is open - it creates a temporary file with ~$
symbol which is a temporary file (and, in this case, I am considering it as a corrupt file).
Now, you can read all the files from the directory, and make one single data frame as follows:
overall_df = []
for f in glob.glob("xlsx_folder/*.xlsx"):
try:
overall_df.append(pd.read_excel(f)) # if there is an encoding error, handle it here
except Exception as err:
print(f"Unable to Read: {f}.\n{err}") # use format if not familiar with f-strings
# delete the file with os.remove
# os.remove(f)
overall_df = pd.concat(overall_df, ignore_index = True)
This prints a warning statement like:
Unable to Read: xlsx_folder\~$file1.xlsx.
[Errno 13] Permission denied: 'xlsx_folder\\~$file1.xlsx'
Upvotes: 1
Reputation: 77857
What exception do you get when you try to load a corrupted Excel file? Run that experiment, and then write a try-except
block to handle the condition.
try:
# load PANDAS df
except CorruptedExcelFile:
os.remove(filename)
From the post you reference, it appears that the problem happens while trying to unzip the file, so the appropriate exception is BadZipFile
. Use that in the except
statement. You probably want to restrict the handling to a particular exception, since the result is removing the offending file.
Upvotes: 1