Reputation: 21
I'm learning machine learning using the iris dataset on Python 3.6 with sklearn, and I don't understand where the class names that are being retrieved are stored. In Iris, there are 3 classes, and each class contains 50 observations. You can use several commands to print the classes, and their associated numerical values:
print(iris.target)
print(iris.target_names)
This will result in the output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']
So as can be seen, the classes are Setosa, Versicolor, and Virginica. What I don't understand is where these class names are being stored, or how they're called upon within the model. If you use the shape command on the data, or target, the result is (150,4) and (150,) meaning there is 150 observations and 4 rows in the data, and 150 rows in the target. I am just not able to bridge the gap with my mind as to where this is coming from, however.
What I don't understand is where the class names are supposed to be stored. If I made a brand new dataset for pokemon types and had ice, fire, water, flying, where could I store these types? Would they be required to be numerical as well, like iris, with 0,1,2,3?
Upvotes: 1
Views: 3421
Reputation: 1095
Sklearn uses a custom type of object to store its datasets, exactly so that they can store metadata along with the raw data.
If you load the iris dataset
In [2]: from sklearn import datasets
In [3]: iris = datasets.load_iris()
You can inspect the type of object with type
:
In [4]: type(iris)
Out[4]: sklearn.utils.Bunch
You can look at the attributes inside the object with dir
:
In [5]: dir(iris)
Out[5]: ['DESCR', 'data', 'feature_names', 'target', 'target_names']
And then use .
notation to take a look at the attributes themselves:
In [6]: type(iris.data)
Out[6]: numpy.ndarray
In [7]: type(iris.target)
Out[7]: numpy.ndarray
In [8]: type(iris.feature_names)
Out[8]: list
If you want to mimic this for your own datasets, you will have to define your own custom object type to mimic this structure. That would involve defining your own class.
Upvotes: 2