Nikhil Raghavendra
Nikhil Raghavendra

Reputation: 1640

EOFError: Compressed file ended before the end-of-stream marker was reached - MNIST data set

I am getting the following error when I run mnist = input_data.read_data_sets("MNIST_data", one_hot = True).

EOFError: Compressed file ended before the end-of-stream marker was reached

Even when I extract the file manually and place it in the MNIST_data directory, the program is still trying to download the file instead of using the extracted file.

When I extract the file using WinZip which is the manual way, WinZip tells me that the file is corrupt.

How do I solve this problem?

I can't even load the data set now, I still have to debug the program itself. Please help.

I pip installed Tensorflow and so I don't have a Tensorflow example. So I went to GitHub to get the input_data file and saved it in the same directory as my main.py. The error is just regarding the .gz file. The program could not extract it.

runfile('C:/Users/Nikhil/Desktop/Tensor Flow/tensf.py', wdir='C:/Users/Nikhil/Desktop/Tensor Flow') Reloaded modules: input_data Extracting MNIST_data/train-images-idx3-ubyte.gz C:\Users\Nikhil\Anaconda3\lib\gzip.py:274: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future return self._buffer.read(size) Traceback (most recent call last):

File "", line 1, in runfile('C:/Users/Nikhil/Desktop/Tensor Flow/tensf.py', wdir='C:/Users/Nikhil/Desktop/Tensor Flow')

File "C:\Users\Nikhil\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile execfile(filename, namespace)

File "C:\Users\Nikhil\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/Nikhil/Desktop/Tensor Flow/tensf.py", line 26, in mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)

File "C:\Users\Nikhil\Desktop\Tensor Flow\input_data.py", line 181, in read_data_sets train_images = extract_images(local_file)

File "C:\Users\Nikhil\Desktop\Tensor Flow\input_data.py", line 60, in extract_images buf = bytestream.read(rows * cols * num_images)

File "C:\Users\Nikhil\Anaconda3\lib\gzip.py", line 274, in read return self._buffer.read(size)

File "C:\Users\Nikhil\Anaconda3\lib_compression.py", line 68, in readinto data = self.read(len(byte_view))

File "C:\Users\Nikhil\Anaconda3\lib\gzip.py", line 480, in read raise EOFError("Compressed file ended before the "

EOFError: Compressed file ended before the end-of-stream marker was reached

Upvotes: 21

Views: 88933

Answers (9)

because_im_batman
because_im_batman

Reputation: 1113

I couldn't seem to find the Keras dataset download folder as mentioned in other answers in my Linux.

So, I found a kinda hacky but easy fix to this problem. Turns out there's a builtin way to force download the files in the mnist library.

  1. Just go to your pip installed copy of the mnist library. This can be found easily inside your python virtual environment.
venv/lib/python3.10/site-packages/mnist/__init__.py

enter image description here

  1. Now, all we need to do is to look for force=False in this file and set them to force=True

Here's the file after the update:

import os
import functools
import operator
import gzip
import struct
import array
import tempfile
try:
    from urllib.request import urlretrieve
except ImportError:
    from urllib import urlretrieve  # py2
try:
    from urllib.parse import urljoin
except ImportError:
    from urlparse import urljoin
import numpy


__version__ = '0.2.2'


# `datasets_url` and `temporary_dir` can be set by the user using:
# >>> mnist.datasets_url = 'http://my.mnist.url'
# >>> mnist.temporary_dir = lambda: '/tmp/mnist'
datasets_url = 'http://yann.lecun.com/exdb/mnist/'
temporary_dir = tempfile.gettempdir


class IdxDecodeError(ValueError):
    """Raised when an invalid idx file is parsed."""
    pass


def download_file(fname, target_dir=None, force=True):
    """Download fname from the datasets_url, and save it to target_dir,
    unless the file already exists, and force is False.

    Parameters
    ----------
    fname : str
        Name of the file to download

    target_dir : str
        Directory where to store the file

    force : bool
        Force downloading the file, if it already exists

    Returns
    -------
    fname : str
        Full path of the downloaded file
    """
    target_dir = target_dir or temporary_dir()
    target_fname = os.path.join(target_dir, fname)

    if force or not os.path.isfile(target_fname):
        url = urljoin(datasets_url, fname)
        urlretrieve(url, target_fname)

    return target_fname


def parse_idx(fd):
    """Parse an IDX file, and return it as a numpy array.

    Parameters
    ----------
    fd : file
        File descriptor of the IDX file to parse

    endian : str
        Byte order of the IDX file. See [1] for available options

    Returns
    -------
    data : numpy.ndarray
        Numpy array with the dimensions and the data in the IDX file

    1. https://docs.python.org/3/library/struct.html
        #byte-order-size-and-alignment
    """
    DATA_TYPES = {0x08: 'B',  # unsigned byte
                  0x09: 'b',  # signed byte
                  0x0b: 'h',  # short (2 bytes)
                  0x0c: 'i',  # int (4 bytes)
                  0x0d: 'f',  # float (4 bytes)
                  0x0e: 'd'}  # double (8 bytes)

    header = fd.read(4)
    if len(header) != 4:
        raise IdxDecodeError('Invalid IDX file, '
                             'file empty or does not contain a full header.')

    zeros, data_type, num_dimensions = struct.unpack('>HBB', header)

    if zeros != 0:
        raise IdxDecodeError('Invalid IDX file, '
                             'file must start with two zero bytes. '
                             'Found 0x%02x' % zeros)

    try:
        data_type = DATA_TYPES[data_type]
    except KeyError:
        raise IdxDecodeError('Unknown data type '
                             '0x%02x in IDX file' % data_type)

    dimension_sizes = struct.unpack('>' + 'I' * num_dimensions,
                                    fd.read(4 * num_dimensions))

    data = array.array(data_type, fd.read())
    data.byteswap()  # looks like array.array reads data as little endian

    expected_items = functools.reduce(operator.mul, dimension_sizes)
    if len(data) != expected_items:
        raise IdxDecodeError('IDX file has wrong number of items. '
                             'Expected: %d. Found: %d' % (expected_items,
                                                          len(data)))

    return numpy.array(data).reshape(dimension_sizes)


def download_and_parse_mnist_file(fname, target_dir=None, force=True):
    """Download the IDX file named fname from the URL specified in dataset_url
    and return it as a numpy array.

    Parameters
    ----------
    fname : str
        File name to download and parse

    target_dir : str
        Directory where to store the file

    force : bool
        Force downloading the file, if it already exists

    Returns
    -------
    data : numpy.ndarray
        Numpy array with the dimensions and the data in the IDX file
    """

    fname = download_file(fname, target_dir=target_dir, force=force)
    fopen = gzip.open if os.path.splitext(fname)[1] == '.gz' else open
    with fopen(fname, 'rb') as fd:
        return parse_idx(fd)


def train_images():
    """Return train images from Yann LeCun MNIST database as a numpy array.
    Download the file, if not already found in the temporary directory of
    the system.

    Returns
    -------
    train_images : numpy.ndarray
        Numpy array with the images in the train MNIST database. The first
        dimension indexes each sample, while the other two index rows and
        columns of the image
    """
    return download_and_parse_mnist_file('train-images-idx3-ubyte.gz')


def test_images():
    """Return test images from Yann LeCun MNIST database as a numpy array.
    Download the file, if not already found in the temporary directory of
    the system.

    Returns
    -------
    test_images : numpy.ndarray
        Numpy array with the images in the train MNIST database. The first
        dimension indexes each sample, while the other two index rows and
        columns of the image
    """
    return download_and_parse_mnist_file('t10k-images-idx3-ubyte.gz')


def train_labels():
    """Return train labels from Yann LeCun MNIST database as a numpy array.
    Download the file, if not already found in the temporary directory of
    the system.

    Returns
    -------
    train_labels : numpy.ndarray
        Numpy array with the labels 0 to 9 in the train MNIST database.
    """
    return download_and_parse_mnist_file('train-labels-idx1-ubyte.gz')


def test_labels():
    """Return test labels from Yann LeCun MNIST database as a numpy array.
    Download the file, if not already found in the temporary directory of
    the system.

    Returns
    -------
    test_labels : numpy.ndarray
        Numpy array with the labels 0 to 9 in the train MNIST database.
    """
    return download_and_parse_mnist_file('t10k-labels-idx1-ubyte.gz')

  1. For efficiency, you can set them back to force=False whenever you don't want to download them again (and don't face this silly issue xD) but these dataset takes like 1 sec to download anyway so, it shouldn't ever be a big issue.

Upvotes: 1

SHRIRAJ CHAUHAN
SHRIRAJ CHAUHAN

Reputation: 1

I had the same issue first u have to download the dataset using below 2 lines of code I am using pycharm

'name'=tensorflow.keras.datasets.fashion_mnist
name.load_data()

run this first, it will download the data then u can load by using below

'name'=tensorflow.keras.datasets.fashion_mnist
(train_images,train_lables),(test_images,test_lables)=name.load_data()

[tag:load_data() error,compressed file ended before,fashion_nmist]

Upvotes: 0

AAYUSH SHAH
AAYUSH SHAH

Reputation: 131

It is very simple in windows :

Go to : C:\Users\Username\.keras\datasets

and then Delete the Dataset that you want to redownload or has the error

Upvotes: 5

Benson Mathew
Benson Mathew

Reputation: 155

I had the same issue when downloading datasets using torchvision on Windows. I was able to fix this by deleting all files from the following path: C:\Users\UserName\MNIST\raw

Upvotes: 1

Rajnish suryavanshi
Rajnish suryavanshi

Reputation: 3434

It happens when you download the datasets and due to some reasons it is not downloaded. Any one struggling in windows when working with pytorch. I have resolved the same issues by deleting the folder which resides in below path

C:/Users/UserName/.pytorch/foldername

Also check in your case .pytorch may not be visible due to disable of hidden file.

Upvotes: 1

Nayan Barde
Nayan Barde

Reputation: 41

At first, from the Keras directory remove the partially installed fashion_mnist directory.

After that, download the files from GitHub

https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/train-labels-idx1-ubyte.gz

Place those files and the extracted files in the fashion_mnist directory in the Keras folder.

This will solve your problem.

Upvotes: 1

Krackle
Krackle

Reputation: 373

To anyone struggling, I had a similar issue. On my Mac Mojave 10.14.3. Taking a class on UDEMY using Anaconda and Jupyter used the following to fix the issue. Finder > Go > Go to Folder > In go to folder window input ~/.keras/datasets/fashion_mnist > delete the partially downloaded files

Go to GitHub and search fashion-mnist-master from https://github.com/zalandoresearch/fashion-mnist.git

Download the file locate the data > fashion file and unzip the four files

Place the four unzipped files into the ~/.keras/datasets/fashion_mnist >

open Jupyter Lab in a new page insert the following:

from keras.datasets import fashion_mnist

#message states using TensorFlow backend

(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

#it will then cycle through as if the download were successful  

Good luck may the odds be in your favor.

Upvotes: 2

Pikamander2
Pikamander2

Reputation: 8319

If the download gets interrupted, delete the C:/tmp/imagenet folder and restart the download.

Also, for people who get here via Google, run the classify_image.py file via the command line instead of using IDLE:

python classify_image.py

Upvotes: 1

Payas Pandey
Payas Pandey

Reputation: 393

This is because for some reason you have an incomplete download for the MNIST dataset.

You will have to manually delete the downloaded folder which usually resides in ~/.keras/datasets or any path specified by you relative to this path, in your case MNIST_data.

Perform the following steps in the terminal (ctrl + alt + t):

  1. cd ~/.keras/datasets/
  2. rm -rf "dataset name"

You should be good to go!

Upvotes: 26

Related Questions