taga
taga

Reputation: 3885

What features to use for regression or classification?

Is there a way to determinate what features are the most relevant for my machine learning model. If i have 20 features, is there a function that will decide what features should I use (or function that will automatically remove features that are not relevant)? I planned to do this for regression or classification model.

My desired output is list of values that are most relevant, and prediction

import pandas as pd
from sklearn.linear_model import LinearRegression

dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'par_3': [15, 3, 16, 65, 24, 56, 13],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

print(variables.shape)
print(results.shape)


reg = LinearRegression()
reg.fit(variables, results)

x = reg.predict([[18, 2, 21]])[0]
print(x)

Upvotes: 0

Views: 349

Answers (4)

The Mask
The Mask

Reputation: 579

Well, initially I faced the same problem.The two methods that I find useful for selecting relevant features are these.

1.You can get the feature importance of each feature of your dataset by using the feature importance property of the model.Feature importance is an inbuilt class that comes with Tree Based Classifiers.

import pandas as pd
import numpy as np
data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

click to see image

2.Correlation Matrix with Heatmap

Correlation states how the features are related to each other or the target variable. It gives an intuition of how the features are correlated with the target variable.

click to see image

This is not my research but this blog feature selection which helped to clear my doubt and I'm sure will do yours too.:)

Upvotes: 0

Jonathan Guymont
Jonathan Guymont

Reputation: 497

When using linear model it is important to use linearly independent features. You can visualize correlation with df.corr():

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error

numpy.random.seed(2)

dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'par_3': [15, 3, 16, 65, 24, 56, 13],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

print(df.corr())
out:
            par_1     par_2     par_3   outcome
par_1    1.000000  0.977935  0.191422  0.913878
par_2    0.977935  1.000000  0.193213  0.919307
par_3    0.191422  0.193213  1.000000 -0.158170
outcome  0.913878  0.919307 -0.158170  1.000000

You can see that par_1 and par_2 are strongly correlated. As @taga mentioned, you can use PCA to map your features to a lower dimensional space where they are linearly independent:

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

pca = PCA(n_components=2)
pca_all = pca.fit_transform(variables)

print(np.corrcoef(pca_all[:, 0], pca_all[:, 1]))
out:
[[1.00000000e+00 1.87242048e-16]
 [1.87242048e-16 1.00000000e+00]]

Remember to validate your model on out of sample data:

X_train = variables[:4]
y_train = results[:4]
X_valid = variables[4:]
y_valid = results[4:]

pca = PCA(n_components=2)
pca.fit(X_train)

pca_train = pca.transform(X_train)
pca_valid = pca.transform(X_valid)
print(pca_train)

reg = LinearRegression()
reg.fit(pca_train, y_train)

yhat_train = reg.predict(pca_train)
yhat_valid = reg.predict(pca_valid)

print(mean_squared_error(yhat_train, y_train))
print(mean_squared_error(yhat_valid, y_valid))

Feature selection is not trivial: there is a lot of sklearn modules that achieve it (See docs) and you should always try at least a couple of them and see which on increase performance on out-of-sample data.

Upvotes: 1

Charles Landau
Charles Landau

Reputation: 4265

You can access the coef_ attribute of your reg object:

print(reg.coef_)

It's an oversimplification to call these weights, as they have a specific meaning in linear regression. But they're what you have.

Upvotes: 1

neko
neko

Reputation: 389

The term you are looking for is feature selection: it consists in identifying which features are the most relevant ones for your analysis. The scikit-learn library has a whole section dedicated to it here.

Another possibility is to resort to dimensionality reduction techniques, like PCA (Principal Component Analysis) or Random Projections. Each technique has its pros and cons, so much depends on the data you have and the specific application.

Upvotes: 1

Related Questions