Reputation: 5302
Hello I am training a model with TensorFlow and Keras, and the dataset was downloaded from https://www.microsoft.com/en-us/download/confirmation.aspx?id=54765
This is a zip folder that I split in the following directories:
.
├── test
│ ├── Cat
│ └── Dog
└── train
├── Cat
└── Dog
Test.cat and test.dog have each folder 1000 jpg photos, and train.cat and traing.dog have each folder 11500 jpg photos.
The load is doing with this code:
batch_size = 16
# Data augmentation and preprocess
train_datagen = ImageDataGenerator(rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
validation_split=0.20) # set validation split
# Train dataset
train_generator = train_datagen.flow_from_directory(
'PetImages/train',
target_size=(244, 244),
batch_size=batch_size,
class_mode='binary',
subset='training') # set as training data
# Validation dataset
validation_generator = train_datagen.flow_from_directory(
'PetImages/train',
target_size=(244, 244),
batch_size=batch_size,
class_mode='binary',
subset='validation') # set as validation data
test_datagen = ImageDataGenerator(rescale=1./255)
# Test dataset
test_datagen = test_datagen.flow_from_directory(
'PetImages/test')
THe model is training with the following code:
history = model.fit(train_generator,
validation_data=validation_generator,
epochs=5)
And i get the following input:
Epoch 1/5
1150/1150 [==============================] - ETA: 0s - loss: 0.0505 - accuracy: 0.9906
But when the epoch is in this point I get the following error:
UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f9e185347d0>
How can I solve this, in order to finish the training?
Thanks
Upvotes: 8
Views: 48736
Reputation: 21
Instead of appending the corrupted list we can just delete at every instance of the error too...
import os
from PIL import Image
folder_path = r"C:\Users\ImageDatasets"
extensions = []
corupt_img_paths=[]
for fldr in os.listdir(folder_path):
sub_folder_path = os.path.join(folder_path, fldr)
for filee in os.listdir(sub_folder_path):
file_path = os.path.join(sub_folder_path, filee)
print('** Path: {} **'.format(file_path), end="\r", flush=True)
try:
im = Image.open(file_path)
except:
print(file_path)
os.remove(file_path)
continue
else:
rgb_im = im.convert('RGB')
if filee.split('.')[1] not in extensions:
extensions.append(filee.split('.')[1])
Upvotes: 2
Reputation: 181
I don't know if this still relevant, but for people who will encounter the same problem in the future:
In this specific situation, there are two corrupted files in the dog_cat dataset:
Just remove them and it will work.
Upvotes: 12
Reputation: 81
You may have an image that is corrupt. In the data preprocessing step, try to use Image.open() to see if all the images can be opened.
Upvotes: 1
Reputation: 8112
I have run into this problem previously. So I developed a python script to test the training and test directories for valid image files. File extensions must be one of jpg, png, bmp or gif so it checks for proper extensions first. Then it tries to read in the image using cv2. If it does not input a valid image an exception is created. In each case the bad file name is printed out. At the conclusion a list called bad_list contains the list of bad file paths. Note directories must be name 'test' and 'train'
import os
import cv2
bad_list=[]
dir=r'c:\'PetImages'
subdir_list=os.listdir(dir) # create a list of the sub directories in the directory ie train or test
for d in subdir_list: # iterate through the sub directories train and test
dpath=os.path.join (dir, d) # create path to sub directory
if d in ['test', 'train']:
class_list=os.listdir(dpath) # list of classes ie dog or cat
# print (class_list)
for klass in class_list: # iterate through the two classes
class_path=os.path.join(dpath, klass) # path to class directory
#print(class_path)
file_list=os.listdir(class_path) # create list of files in class directory
for f in file_list: # iterate through the files
fpath=os.path.join (class_path,f)
index=f.rfind('.') # find index of period infilename
ext=f[index+1:] # get the files extension
if ext not in ['jpg', 'png', 'bmp', 'gif']:
print(f'file {fpath} has an invalid extension {ext}')
bad_list.append(fpath)
else:
try:
img=cv2.imread(fpath)
size=img.shape
except:
print(f'file {fpath} is not a valid image file ')
bad_list.append(fpath)
print (bad_list)
Upvotes: 5
Reputation: 3574
Try this function to check if the image are all in correct format.
import os
from PIL import Image
folder_path = 'data\img'
extensions = []
for fldr in os.listdir(folder_path):
sub_folder_path = os.path.join(folder_path, fldr)
for filee in os.listdir(sub_folder_path):
file_path = os.path.join(sub_folder_path, filee)
print('** Path: {} **'.format(file_path), end="\r", flush=True)
im = Image.open(file_path)
rgb_im = im.convert('RGB')
if filee.split('.')[1] not in extensions:
extensions.append(filee.split('.')[1])
Upvotes: 15