Reputation: 1055

remove corrupted xlsx files from a given directory

Update

There are some .xlsx files in an specific directory that are corrupted, because the windows message when trying to open the workbook is the following:

Excel cannot open the file 'filename.xlsx' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.`

I am wondering if it is possible to detect these corrupted files and remove it from directory.

My trial:

############### path settlement and file names ##########
path_reportes = os.path.join(os.getcwd(), 'Reports', 'xlsx_folder')
file_names = os.listdir(path_reportes)
overall_df = dict()

############## concatenate all reports ##################

for file_name in file_names:

    data_file_path = os.path.join(path_reportes, file_name)
    """
    try open spreadsheets, save them and store them in a dictionary key
    except when the file is corrupted, if so, remove it from the 
    folder
    """
    try:
     # Start by opening the spreadsheet and selecting the main sheet
        workbook = openpyxl.load_workbook(filename=data_file_path)
        sheet = workbook.active
    
     # Save the spreadsheet
        workbook.save(filename=data_file_path)
        df_report_dict = pd.read_excel(data_file_path, sheet_name=None, engine='openpyxl')
    
        for key in df_report_dict:
            
            df_report_dict[key]['report_name'] = file_name
            
            try:
                  overall_df[key] = overall_df[key].append(df_report_dict[key], ignore_index=True)
            except:
                  overall_df[key] = df_report_dict[key]
                
                
    # when file corrupted then remove it from the folder             
    except BadZipFile:
                   os.remove(data_file_path)

Which throws the next error:

NameError: name 'BadZipFile' is not defined

Is it possible to detect corrupted files? How could I handle them?

Upvotes: 0

Answers (3)

Booboo

Reputation: 44323

If you are still having an issue with BadZipFile not defined, then:

Since the exception class BadZipFile is in module zipfile, you just need an import statement such as:

from zipfile import BadZipFile

You should then be able to handle the exception.

Upvotes: 1

Mr. Hobo

Reputation: 563

Scenario: I've created three identical excel files in a directory named xlsx_folder, and want to merge all the files into a data frame. Instead of using os module, I recommend the use of glob for this purpose.

import os   # for deleting corrupted file
import glob # to list out a specific file type
import pandas as pd

# here is a list of all the file in the directory
print(glob.glob("xlsx_folder/*.xlsx"))

Output:

['xlsx_folder\\file1 - Copy (2).xlsx',
 'xlsx_folder\\file1 - Copy.xlsx',
 'xlsx_folder\\file1.xlsx',
 'xlsx_folder\\~$file1.xlsx']

Note: In windows when an excel file is open - it creates a temporary file with ~$ symbol which is a temporary file (and, in this case, I am considering it as a corrupt file).

Now, you can read all the files from the directory, and make one single data frame as follows:

overall_df = []
for f in glob.glob("xlsx_folder/*.xlsx"):
    try:
        overall_df.append(pd.read_excel(f)) # if there is an encoding error, handle it here
    except Exception as err:
        print(f"Unable to Read: {f}.\n{err}") # use format if not familiar with f-strings
        # delete the file with os.remove
        # os.remove(f)
        
overall_df = pd.concat(overall_df, ignore_index = True)

This prints a warning statement like:

Unable to Read: xlsx_folder\~$file1.xlsx.
[Errno 13] Permission denied: 'xlsx_folder\\~$file1.xlsx'

Upvotes: 1

Prune

Reputation: 77910

What exception do you get when you try to load a corrupted Excel file? Run that experiment, and then write a try-except block to handle the condition.

try:
    # load PANDAS df

except CorruptedExcelFile:
    os.remove(filename)

From the post you reference, it appears that the problem happens while trying to unzip the file, so the appropriate exception is BadZipFile. Use that in the except statement. You probably want to restrict the handling to a particular exception, since the result is removing the offending file.

Upvotes: 1

remove corrupted xlsx files from a given directory

Update

Answers (3)

Related Questions