Reputation: 3885
Is there a way to determinate what features are the most relevant for my machine learning model. If i have 20 features, is there a function that will decide what features should I use (or function that will automatically remove features that are not relevant)? I planned to do this for regression or classification model.
My desired output is list of values that are most relevant, and prediction
import pandas as pd
from sklearn.linear_model import LinearRegression
dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
'par_2': [1, 3, 1, 2, 3, 3, 2],
'par_3': [15, 3, 16, 65, 24, 56, 13],
'outcome': [101, 905, 182, 268, 646, 624, 465]}
df = pd.DataFrame(dic)
variables = df.iloc[:,:-1]
results = df.iloc[:,-1]
print(variables.shape)
print(results.shape)
reg = LinearRegression()
reg.fit(variables, results)
x = reg.predict([[18, 2, 21]])[0]
print(x)
Upvotes: 0
Views: 349
Reputation: 579
Well, initially I faced the same problem.The two methods that I find useful for selecting relevant features are these.
1.You can get the feature importance of each feature of your dataset by using the feature importance property of the model.Feature importance is an inbuilt class that comes with Tree Based Classifiers.
import pandas as pd
import numpy as np
data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
2.Correlation Matrix with Heatmap
Correlation states how the features are related to each other or the target variable. It gives an intuition of how the features are correlated with the target variable.
This is not my research but this blog feature selection which helped to clear my doubt and I'm sure will do yours too.:)
Upvotes: 0
Reputation: 497
When using linear model it is important to use linearly independent features. You can visualize correlation with df.corr()
:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
numpy.random.seed(2)
dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
'par_2': [1, 3, 1, 2, 3, 3, 2],
'par_3': [15, 3, 16, 65, 24, 56, 13],
'outcome': [101, 905, 182, 268, 646, 624, 465]}
df = pd.DataFrame(dic)
print(df.corr())
out:
par_1 par_2 par_3 outcome
par_1 1.000000 0.977935 0.191422 0.913878
par_2 0.977935 1.000000 0.193213 0.919307
par_3 0.191422 0.193213 1.000000 -0.158170
outcome 0.913878 0.919307 -0.158170 1.000000
You can see that par_1
and par_2
are strongly correlated. As @taga mentioned, you can use PCA
to map your features to a lower dimensional space where they are linearly independent:
variables = df.iloc[:,:-1]
results = df.iloc[:,-1]
pca = PCA(n_components=2)
pca_all = pca.fit_transform(variables)
print(np.corrcoef(pca_all[:, 0], pca_all[:, 1]))
out:
[[1.00000000e+00 1.87242048e-16]
[1.87242048e-16 1.00000000e+00]]
Remember to validate your model on out of sample data:
X_train = variables[:4]
y_train = results[:4]
X_valid = variables[4:]
y_valid = results[4:]
pca = PCA(n_components=2)
pca.fit(X_train)
pca_train = pca.transform(X_train)
pca_valid = pca.transform(X_valid)
print(pca_train)
reg = LinearRegression()
reg.fit(pca_train, y_train)
yhat_train = reg.predict(pca_train)
yhat_valid = reg.predict(pca_valid)
print(mean_squared_error(yhat_train, y_train))
print(mean_squared_error(yhat_valid, y_valid))
Feature selection is not trivial: there is a lot of sklearn modules that achieve it (See docs) and you should always try at least a couple of them and see which on increase performance on out-of-sample data.
Upvotes: 1
Reputation: 4265
You can access the coef_ attribute of your reg
object:
print(reg.coef_)
It's an oversimplification to call these weights, as they have a specific meaning in linear regression. But they're what you have.
Upvotes: 1
Reputation: 389
The term you are looking for is feature selection: it consists in identifying which features are the most relevant ones for your analysis. The scikit-learn
library has a whole section dedicated to it here.
Another possibility is to resort to dimensionality reduction techniques, like PCA (Principal Component Analysis) or Random Projections. Each technique has its pros and cons, so much depends on the data you have and the specific application.
Upvotes: 1