Reputation: 129
I am trying to do a cross validation with the ROC metric to evaluate the classifier, and I came across with the following code from Scikit learn :
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
I have trouble understanding the X,y = X[y!=2],y[y!=2]
line, what is the purpose of this line?
Also, can someone possibly help me to clarify the use of underline
n_samples, n_features
?
Thanks!
Upvotes: 1
Views: 35
Reputation: 1415
Iris dataset has three classes labeled 0, 1, 2.
When you see
X, y = X[y != 2], y[y != 2]
it just means new values of X and y will not contain records for class with a label 2.
Here is how it works.
y != 2
returns a boolean vector equal to the length of y, that contains True when y was 0 or 1, and False where y was 2, according to the given condition y != 2. I.e. [True, False, False, ...]
. It is also sometimes called a mask.
y[y != 2]
is boolean-based indexing, it returns a new array consisting of such elements of y where y is not 2. I.e. the resulting array will not contain 2s.
Finally, X[y != 2]
return a new array X with elements that correspond to True values of a mask.
Since X and y a re of the same length, applying the same mask to it works perfectly, and in this case effectively all records with class label 2 are removed.
Now for the purpose of removing en entire class from the dataset - this is something you should look for in the tutorial your were reading.
X.shape returns a tuple with number of rows and number of columns in a dataframe. This is what data scientists call samples and features.
Upvotes: 2