WY G
WY G

Reputation: 129

Measure classifier by using cross validation with ROC metrics

I am trying to do a cross validation with the ROC metric to evaluate the classifier, and I came across with the following code from Scikit learn :

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape

I have trouble understanding the X,y = X[y!=2],y[y!=2] line, what is the purpose of this line?

Also, can someone possibly help me to clarify the use of underline n_samples, n_features?

Thanks!

Upvotes: 1

Views: 35

Answers (1)

Artem Trunov
Artem Trunov

Reputation: 1415

Iris dataset has three classes labeled 0, 1, 2. When you see X, y = X[y != 2], y[y != 2] it just means new values of X and y will not contain records for class with a label 2.

Here is how it works. y != 2 returns a boolean vector equal to the length of y, that contains True when y was 0 or 1, and False where y was 2, according to the given condition y != 2. I.e. [True, False, False, ...]. It is also sometimes called a mask.

y[y != 2] is boolean-based indexing, it returns a new array consisting of such elements of y where y is not 2. I.e. the resulting array will not contain 2s.

Finally, X[y != 2] return a new array X with elements that correspond to True values of a mask.

Since X and y a re of the same length, applying the same mask to it works perfectly, and in this case effectively all records with class label 2 are removed.

Now for the purpose of removing en entire class from the dataset - this is something you should look for in the tutorial your were reading.

X.shape returns a tuple with number of rows and number of columns in a dataframe. This is what data scientists call samples and features.

Upvotes: 2

Related Questions