Checking for corrupted files in directory with hundreds of thousands of images gradually slows down

Question

So I have 600,000+ images. My estimate is that roughly 5-10% of these are corrupted. I'm generating a log of exactly which images this pertains to.

Using Python, my approach thus far is this:

def img_validator(source):
    files = get_paths(source)  # A list of complete paths to each image
    invalid_files = []
    for img in files:
        try:
            im = Image.open(img)
            im.verify()
            im.close()
        except (IOError, OSError, Image.DecompressionBombError):
            invalid_files.append(img)

     # Write invalid_files to file

The first 200-250K images are quite fast to process, only around 1-2 hours. I left the process running overnight (at the time it was at 230K), 8 hours later it was only at 310K, but still progressing.

Anyone got an idea of why that is? At first I thought it might be due to the images being stored on an HDD, but that doesn't really make sense seeing as it was very fast the first 200-250k.

Mark Setchell · Accepted Answer

If you have that many images, I would suggest you use multiprocessing. I created 100,000 files of which 5% were corrupt and checked them like this:

#!/usr/bin/env python3

import glob
from multiprocessing import Pool
from PIL import Image

def CheckOne(f):
    try:
        im = Image.open(f)
        im.verify()
        im.close()
        # DEBUG: print(f"OK: {f}")
        return
    except (IOError, OSError, Image.DecompressionBombError):
        # DEBUG: print(f"Fail: {f}")
        return f

if __name__ == '__main__':
    # Create a pool of processes to check files
    p = Pool()

    # Create a list of files to process
    files = [f for f in glob.glob("*.jpg")]

    print(f"Files to be checked: {len(files)}")

    # Map the list of files to check onto the Pool
    result = p.map(CheckOne, files)

    # Filter out None values representing files that are ok, leaving just corrupt ones
    result = list(filter(None, result)) 
    print(f"Num corrupt files: {len(result)}")

Sample Output

Files to be checked: 100002
Num corrupt files: 5001

That takes 1.6 seconds on my 12-core CPU with NVME disk, but should still be noticeably faster for you.

Checking for corrupted files in directory with hundreds of thousands of images gradually slows down

Answers (1)

Related Questions