wong.lok.yin
wong.lok.yin

Reputation: 889

Unknown image file format. One of JPEG, PNG, GIF, BMP required

I built a simple CNN model and it raised below errors:

Epoch 1/10
235/235 [==============================] - ETA: 0s - loss: 540.2643 - accuracy: 0.4358
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-14-ab88232c98aa> in <module>()
     15     train_ds,
     16     validation_data=val_ds,
---> 17     epochs=epochs
     18 )

7 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

InvalidArgumentError:  Unknown image file format. One of JPEG, PNG, GIF, BMP required.
     [[{{node decode_image/DecodeImage}}]]
     [[IteratorGetNext]] [Op:__inference_test_function_2924]

Function call stack:
test_function

The code I wrote is quite simple and standard. Most of them are just directly copied from the official website. It raised this error before the first epoch finish. I am pretty sure that the images are all png files. The train folder does not contain anything like text, code, except imgages. I am using Colab. The version of tensorlfow is 2.5.0. Appreciate for any help.

data_dir = './train'

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir, 
    subset='training',
    validation_split=0.2,
    batch_size=batch_size,
    seed=42
)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir, 
    subset='validation',
    validation_split=0.2,
    batch_size=batch_size,
    seed=42
)

model = Sequential([
    layers.InputLayer(input_shape=(image_size, image_size, 3)),
    layers.Conv2D(32, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(num_classes)
    ])

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(
    optimizer=optimizer,
    loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs
)

Upvotes: 10

Views: 20694

Answers (6)

Cheesecode
Cheesecode

Reputation: 1

Lescruel's answer saved my butt! I made a small amendment in case you automatically want to remove the images that are not usable:

from pathlib import Path
import imghdr
import os

# Define the directory containing the images

# List of valid image extensions
image_extensions = [".png", ".jpg", ".jpeg", ".bmp", ".gif"]

# Image types accepted by TensorFlow
img_type_accepted_by_tf = ["bmp", "gif", "jpeg", "png"]

# Loop through all files in the directory and subdirectories
for filepath in Path(extract_path).rglob("*"):
    # Check if it's a file before proceeding
    if filepath.is_file():
        # Check if the file has a valid image extension
        if filepath.suffix.lower() in image_extensions:
            # Check the actual image type
            img_type = imghdr.what(filepath)
            if img_type is None:
                print(f"{filepath} is not an image. Deleting...")
                os.remove(filepath)  # Delete the file
            elif img_type not in img_type_accepted_by_tf:
                print(f"{filepath} is a {img_type}, not accepted by TensorFlow. Deleting...")
                os.remove(filepath)  # Delete the file
        else:
            # If the file does not have a valid extension
            print(f"{filepath} is not a recognized image type. Deleting...")
            os.remove(filepath)  # Delete the file

Upvotes: 0

Louis Lac
Louis Lac

Reputation: 6406

As stated in other answers you can use the imghdr built-in Python module to guess the image format and assert that it is not corrupted and that is matches the file extension.

However, starting from Python 3.11 imghdr is deprecated (PEP 594) and will be removed in Python 3.13 due to its limited number of formats supported and its limited functionality.

three alternatives are listed in the PEP: filetype, puremagic and python-magic.

Here is an example use with filetype:

from pathlib import Path
import filetype


# RFC image file extensions supported by TensorFlow
img_exts = {"png", "jpg", "gif", "bmp"}

path = Path("train")

for file in path.iterdir():
    if file.is_dir():
        continue

    ext = filetype.guess_extension(file)

    if ext is None:
        print(f"'{file}': extension cannot be guessed from content")
    elif ext not in img_exts:
        print(f"'{file}': not a supported image file")

Upvotes: 1

Vikash Ramaswamy
Vikash Ramaswamy

Reputation: 1

I had the same issue. I went through a lot of answers above and none of them worked for me. So, I wrote the training loop inside the try except blocks and the batch that has these problems will be skipped. Please note: this is not a direct solution.

iterator = iter(preprocessed_train_dataset)
max_iterations = len(preprocessed_train_dataset)
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    # Iterate over the batches of the dataset.
    i = 0
    while i < max_iterations:
        print("Currently running {} batch".format(i))
        try:
            i = i + 1
            x_batch_train, y_batch_train = next(iterator)
            with tf.GradientTape() as tape:
                logits = model(x_batch_train, training=True)
                loss_value = loss_fn(y_batch_train, logits)

            grads = tape.gradient(loss_value, model.trainable_weights)
            optimizer.apply_gradients(zip(grads, model.trainable_weights))

            # Log every 200 batches.
            if i % 200 == 0:
                print(
                    "Training loss (for one batch) at step %d: %.4f"
                    % (i, float(loss_value))
                )
                print("Seen so far: %s samples" % ((i + 1) * batch_size))

            train_acc = train_acc_metric.result()
            print("Training acc over epoch: %.4f" % (float(train_acc),))

            # Reset training metrics at the end of each epoch
            train_acc_metric.reset_states()
            for x_batch_val, y_batch_val in preprocessed_val_dataset:
                val_logits = model(x_batch_val, training=False)
                # Update val metrics
                val_acc_metric.update_state(y_batch_val, val_logits)
            val_acc = val_acc_metric.result()
            val_acc_metric.reset_states()
            print("Validation acc: %.4f" % (float(val_acc),))
        except Exception as e:
            continue

# Evaluate the model
test_loss, test_accuracy = model.evaluate(preprocessed_test_dataset)

Upvotes: 0

Inuwa Mobarak
Inuwa Mobarak

Reputation: 84

TensorFlow has some strictness when dealing with image formats. This should guide in deleting the bad images. Some times your data set may even run well with, for instance Torch but will generate a format error with Tf. Nonetheless, it is best practice to always carryout preprocessing on the images to ensure a robust, safe and standard model.

from pathlib import Path
import imghdr

from pathlib import Path
import imghdr

img_link=list(Path("/home/user/datasets/samples/").glob(r'**/*.jpg'))

count_num=0
for lnk in img_link:
    binary_img=open(lnk,'rb')
    find_img=tf.compat.as_bytes('JFIF') in binary_img.peek(10)#The JFIF is a JPEG File Interchange Format (JFIF). It is a standard which we gauge if an image is corrupt or substandard
    if not find_img:
        count_num+=1
        os.remove(str(lnk))
print('Total %d pcs image delete from Dataset' % count_num)
#this should help you delete the bad encoded

Upvotes: 3

Anass Maourid
Anass Maourid

Reputation: 31

this should work fine, the same for supported types ... ex for png :

image = tf.io.read_file("im.png")
image = tf.image.decode_png(image, channels=3)

Upvotes: 1

Lescurel
Lescurel

Reputation: 11631

Some of your files in the validation folder are not in the format accepted by Tensorflow ( JPEG, PNG, GIF, BMP), or may be corrupted. The extension of a file is indicative only, and does not enforce anything on the content of the file.

You might be able to find the culprit using the imghdr module from the python standard library, and a simple loop.

from pathlib import Path
import imghdr

data_dir = "/home/user/datasets/samples/"
image_extensions = [".png", ".jpg"]  # add there all your images file extensions

img_type_accepted_by_tf = ["bmp", "gif", "jpeg", "png"]
for filepath in Path(data_dir).rglob("*"):
    if filepath.suffix.lower() in image_extensions:
        img_type = imghdr.what(filepath)
        if img_type is None:
            print(f"{filepath} is not an image")
        elif img_type not in img_type_accepted_by_tf:
            print(f"{filepath} is a {img_type}, not accepted by TensorFlow")

This should print out whether you have files that are not images, or that are not what their extension says they are, and not accepted by TF. Then you can either get rid of them or convert them to a format that TensorFlow supports.

Upvotes: 27

Related Questions