dkapitan
dkapitan

Reputation: 931

What does y, _ assignment do in python / sklearn?

As a relative new-comer to Python I am trying to use the sklearn RandomForestClassifier. One example from a how-to guide by yhat is the following:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Factor(iris.target, iris.target_names)
df.head()

train, test = df[df['is_train']==True], df[df['is_train']==False]

features = df.columns[:4]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['species']) # assignment I don't understand
clf.fit(train[features], y)

preds = iris.target_names[clf.predict(test[features])]
pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

Can some explain what the y, _ assignment does and how it works. It isn't used explicitly, but I get an error if I leave it out.

Upvotes: 5

Views: 413

Answers (2)

freakish
freakish

Reputation: 56517

It means that pd.factorize(train['species']) is returning a tuple/list/generator/iterable of two items. In Python you can do

x, y = [1, 2]

and now x == 1 and y == 2. In your case y becomes the first value and variable _ the second. Underscore _ is often used as a name for variable which is not going to be used.

Upvotes: 3

Juri Robl
Juri Robl

Reputation: 5746

You decompose the returned tuple into two distinct values, y and _.

_ is convention for "I don't need that value anymore".

It's basically the same as:

y = pd.factorize(train['species'])[0]

with the exception that this code would work for any indexable return value with at least 1 element, while yours explicitly needs exactly two items in the returned value.

Upvotes: 8

Related Questions