Reputation: 4797

How to use datasets.fetch_mldata() in sklearn?

I am trying to run the following code for a brief machine learning algorithm:

import re
import argparse
import csv
from collections import Counter
from sklearn import datasets
import sklearn
from sklearn.datasets import fetch_mldata

dataDict = datasets.fetch_mldata('MNIST Original')

In this piece of code, I am trying to read the dataset 'MNIST Original' present at mldata.org via sklearn. This results in the following error(there are more lines of code but I am getting error at this particular line):

Traceback (most recent call last):
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1481, in <module>
    debugger.run(setup['file'], None, None)
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1124, in run
    pydev_imports.execfile(file, globals, locals) #execute the script
  File "C:/Users/sony/PycharmProjects/Machine_Learning_Homework1/zeroR.py", line 131, in <module>
    dataDict = datasets.fetch_mldata('MNIST Original')
  File "C:\Anaconda\lib\site-packages\sklearn\datasets\mldata.py", line 157, in fetch_mldata
    matlab_dict = io.loadmat(matlab_file, struct_as_record=True)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio.py", line 176, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 294, in get_variables
    res = self.read_var_array(hdr, process)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 257, in read_var_array
    return self._matrix_reader.array_from_header(header, process)
  File "mio5_utils.pyx", line 624, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5717)
  File "mio5_utils.pyx", line 653, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5147)
  File "mio5_utils.pyx", line 721, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex (scipy\io\matlab\mio5_utils.c:6134)
  File "mio5_utils.pyx", line 424, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric (scipy\io\matlab\mio5_utils.c:3704)
  File "mio5_utils.pyx", line 360, in scipy.io.matlab.mio5_utils.VarReader5.read_element (scipy\io\matlab\mio5_utils.c:3429)
  File "streams.pyx", line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711)
IOError: could not read bytes

I have tried researching on internet but there is hardly any help available. Any expert help related to solving this error will be much appreciated.

TIA.

Upvotes: 18

Answers (11)

skovorodkin

Reputation: 10294

As of version 0.20, sklearn deprecates fetch_mldata function and adds fetch_openml instead.

Download MNIST dataset with the following code:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

There are some changes to the format though. For instance, mnist['target'] is an array of string category labels (not floats as before).

Upvotes: 32

Soundous Bahri

Reputation: 106

I downloaded the dataset from this link

https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat

then I typed these lines

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', transpose_data=True, data_home='files')

*** the path is (your working directory)/files/mldata/mnist-original.mat

I hope you get it , it worked well for me

Upvotes: 5

Bal Krishna Jha

Reputation: 7286

Apart from what @szymon has mentioned you can alternatively load dataset using:

from six.moves import urllib
from sklearn.datasets import fetch_mldata

from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
    content = response.read()
    f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
    "data": mnist_raw["data"].T,
    "target": mnist_raw["label"][0],
    "COL_NAMES": ["label", "data"],
    "DESCR": "mldata.org dataset: mnist-original",
}

Upvotes: 0

Thang Tran

Reputation: 1

I also had this problem in the past. It is due to the dataset is quite large (about 55.4 mb), I run the "fetch_mldata" but because of the internet connection, it took awhile to download them all. I did not know and interrupt the process.

The dataset is corrupted and that why the error happened.

Upvotes: 0

YH Hsu

Reputation: 11

I experienced the same issue and found different file size of mnist-original.mat at different times while I use my poor WiFi. I switched to LAN and it works fine. It maybe the issue of networking.

Upvotes: 1

mcolak

Reputation: 637

If you didn't give the data_home, program look the ${yourprojectpath}/mldata/minist-original.mat you can download the program and put the file the correct path

Upvotes: 0

Victoria Stuart

Reputation: 5102

I was also getting a fetch_mldata() "IOError: could not read bytes" error. Here is the solution; the relevant lines of code are

from sklearn.datasets.mldata import fetch_mldata
mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')

... be sure to change 'data_home' for your preferred location (directory).

Here is a script:

#!/usr/bin/python
# coding: utf-8

# Source:
# https://stackoverflow.com/questions/19530383/how-to-use-datasets-fetch-mldata-in-sklearn
# ... modified, below, by Victoria

"""
pers. comm. (Jan 27, 2016) from MLdata.org MNIST dataset contactee "Cheng Ong:"

    The MNIST data is called 'mnist-original'. The string you pass to sklearn
    has to match the name of the URL:

    from sklearn.datasets.mldata import fetch_mldata
    data = fetch_mldata('mnist-original')
"""

def get_data():

    """
    Get MNIST data; returns a dict with keys 'train' and 'test'.
    Both have the keys 'X' (features) and 'y' (labels)
    """

    from sklearn.datasets.mldata import fetch_mldata

    mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1]
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split

    x_train, x_test, y_train, y_test = train_test_split(x, y,
        test_size=0.33, random_state=42)

    data = {'train': {'X': x_train, 'y': y_train},
            'test': {'X': x_test, 'y': y_test}}

    return data

data = get_data()
print '\n', data, '\n'

Upvotes: 0

Martin Thoma

Reputation: 136855

Here is some sample code how to get MNIST data ready to use for sklearn:

def get_data():
    """
    Get MNIST data ready to learn with.

    Returns
    -------
    dict
        With keys 'train' and 'test'. Both do have the keys 'X' (features)
        and'y' (labels)
    """
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1] - This is of mayor importance!!!
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                        test_size=0.33,
                                                        random_state=42)
    data = {'train': {'X': x_train,
                      'y': y_train},
            'test': {'X': x_test,
                     'y': y_test}}
    return data

Upvotes: 1

Szymon Laszczyński

Reputation: 101

Looks like the cached data are corrupted. Try removing them and download again (it takes a moment). If not specified differently the data for 'MINST original' should be in

~/scikit_learn_data/mldata/mnist-original.mat

Upvotes: 10

Brent

Reputation: 729

Try it like this:

dataDict = fetch_mldata('MNIST original')

This worked for me. Since you used the from ... import ... syntax, you shouldn't prepend datasets when you use it

Upvotes: 0

Lucas Ribeiro

Reputation: 6292

That's 'MNIST original'. With a lowercase on "o".

Upvotes: -1

How to use datasets.fetch_mldata() in sklearn?

Answers (11)

Related Questions