Simon
Simon

Reputation: 5698

Sklearn Multiclass Dataset Loading

For a multiclass problem I use Scikit-Learn. I find very little examples on how to load a custom dataset with multiple classes. The sklearn.datasets.load_files method does not seem to be suitable as files need to be stored multiple times. I now have the following structure:

X => Python list with lists of features (in text).

y => Python list with lists of classes (in text).

How do I transform this to a structure Scikit-Learn can use in a classifier?

Upvotes: 1

Views: 436

Answers (1)

Guiem Bosch
Guiem Bosch

Reputation: 2758

    import numpy as np
    from sklearn.preprocessing import MultiLabelBinarizer

    X = np.loadtxt('samples.csv', delimiter=",")
    y_aux = np.loadtxt('targets.csv', delimiter=",")
    y = MultiLabelBinarizer().fit_transform(y_aux)

Code explanation: Let's say you have all your features stored in a file called samples.csv and the multiclass labels in another file called targets.csv (they could be of course stored in the same file and you'd just need to split columns). For clarity in this example my files contain:

  • samples.csv
    4.0,3.2,5.5
    6.8,5.6,3.3
  • targets.csv
    1,4 <-- sample one belongs to classes 1 and 4
    2,3 <-- sample two belongs to classes 2,3

MultiLabelBinarizer encodes the output targets in such a way that y variable is ready to be fed into Multiclass classifiers. The output of the code is:

y = array([[1, 0, 0, 1],
   [0, 1, 1, 0]])

meaning sample one belongs to classes 1 and 4 and sample two belongs to 2 and 3.

Upvotes: 1

Related Questions