Reputation: 47
So I have 600,000+ images. My estimate is that roughly 5-10% of these are corrupted. I'm generating a log of exactly which images this pertains to.
Using Python, my approach thus far is this:
def img_validator(source):
files = get_paths(source) # A list of complete paths to each image
invalid_files = []
for img in files:
try:
im = Image.open(img)
im.verify()
im.close()
except (IOError, OSError, Image.DecompressionBombError):
invalid_files.append(img)
# Write invalid_files to file
The first 200-250K images are quite fast to process, only around 1-2 hours. I left the process running overnight (at the time it was at 230K), 8 hours later it was only at 310K, but still progressing.
Anyone got an idea of why that is? At first I thought it might be due to the images being stored on an HDD, but that doesn't really make sense seeing as it was very fast the first 200-250k.
Upvotes: 1
Views: 6865
Reputation: 207660
If you have that many images, I would suggest you use multiprocessing. I created 100,000 files of which 5% were corrupt and checked them like this:
#!/usr/bin/env python3
import glob
from multiprocessing import Pool
from PIL import Image
def CheckOne(f):
try:
im = Image.open(f)
im.verify()
im.close()
# DEBUG: print(f"OK: {f}")
return
except (IOError, OSError, Image.DecompressionBombError):
# DEBUG: print(f"Fail: {f}")
return f
if __name__ == '__main__':
# Create a pool of processes to check files
p = Pool()
# Create a list of files to process
files = [f for f in glob.glob("*.jpg")]
print(f"Files to be checked: {len(files)}")
# Map the list of files to check onto the Pool
result = p.map(CheckOne, files)
# Filter out None values representing files that are ok, leaving just corrupt ones
result = list(filter(None, result))
print(f"Num corrupt files: {len(result)}")
Sample Output
Files to be checked: 100002
Num corrupt files: 5001
That takes 1.6 seconds on my 12-core CPU with NVME disk, but should still be noticeably faster for you.
Upvotes: 4