Reputation: 1536
I'm trying to work through an example script on machine learning: Common pitfalls in interpretation of coefficients of linear models but I'm having trouble understanding some of the steps. The beginning of the script looks like this:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
# We identify features `X` and targets `y`: the column WAGE is our
# target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")
X.head()
# Our target for prediction is the wage.
y = survey.target.values.ravel()
survey.target.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
My problem is in the lines
y = survey.target.values.ravel()
survey.target.head()
If we examine survey.target.head()
immediately after these lines, the output is
Out[36]:
0 5.10
1 4.95
2 6.67
3 4.00
4 7.50
Name: WAGE, dtype: float64
How does the model know that WAGE
is the target variable? Does is not have to be explicitly declared?
Upvotes: 1
Views: 75
Reputation: 46908
The line survey.target.values.ravel()
is meant to flatten the array, but in this example it is not necessary. survey.target is a pd Series (i.e 1 column data frame) and survey.target.values is a numpy array. You can use both for train/test split since there is only 1 column in survey.target
.
type(survey.target)
pandas.core.series.Series
type(survey.target.values)
numpy.ndarray
If we use just survey.target, you can see that the regression will work:
y = survey.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
If you have another dataset, for example iris, I want to regress petal width against the rest. You would call the column of the data.frame using the square brackets []
:
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
dat = load_iris(as_frame=True).frame
X = dat[['sepal length (cm)','sepal width (cm)','petal length (cm)']]
y = dat[['petal width (cm)']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
LR = LinearRegression()
LR.fit(X_train,y_train)
plt.scatter(x=y_test,y=LR.predict(X_test))
Upvotes: 1