Reputation: 119
I'm new to Sklearn and python ; I have this code snippet for a project that I'm trying to decipher. I hope you guys can help me with it.
from repository import Repository
from configuration import config
repository = Repository(config)
dataset, labels = repository.get_dataset_and_labels()
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV
# Ensure that there are no NaNs
dataset = dataset.fillna(-85)
# Split the dataset into training (90 \%) and testing (10 \%)
X_train, X_test, y_train, y_test = train_test_split(dataset, labels, test_size = 0.1 )
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0)
# Define the classifier to use
estimator = SVC(kernel='linear')
# Define parameter space
gammas = np.logspace(-6, -1, 10)
# Use Test dataset and use cross validation to find bet hyper-p rameters.
classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=dict(gamma=gammas))
classifier.fit(X_train, [repository.locations.keys().index(tuple(l)) for l in y_train])
what I can't wrap my head around is the use of the fit method of the classifier. In all the examples I found online, 'fit' receives the training data and the corresponding labels. In the example above, 'fit' receives the training data and the indexes of the labels (not the labels). How is it that the classifier takes the indexes not the labels and still works
Upvotes: 0
Views: 652
Reputation: 3023
Obviously your labels are encoded here:
[repository.locations.keys().index(tuple(l)) for l in y_train]
Apart of this, I think it is worth to look at the SearchGridCV documentation.
Upvotes: -1
Reputation: 66775
Label is just an abstract term. It can be anything, word, number, index, anything. In your case (whatever is repository.locations.keys().index(...)
, let us just assume that it is a deterministic function, for simplicity lets call it f
), you create a list
[f(tuple(l)) for l in y_train]
y_train
itself is a list (or more general - iterable). So the above is also a list of labels, simply transformed through f
, for some other reason (maybe in this particular case user needs simply different set of labels than in the original dataset?). Either way, you still pass labels to your fit
method, they are simply transformed.
Consider for example set of labels ['cat', 'dog']
, it does not really matter whether I train a model on [x1, x2, x3]
, ['cat', 'cat', 'dog']
or on [x2,x3,x3]
, [0, 0, 1]
(indices of labels).
Upvotes: 2