Reputation: 3580
I'm working on multivariable regression from a csv, predicting crop performance based on multiple factors. Some of my columns are numerical and meaningful. Others are numerical and categorical, or strings and categorical (for instance, crop variety, or plot code or whatever.) How do I teach Python to use them? I've found One Hot Encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but don't really understand how to apply it here.
My code so far:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('filepath.csv')
df.drop(df[df['LabeledDataColumn'].isnull()].index.tolist(),inplace=True)
scale = StandardScaler()
pd.options.mode.chained_assignment = None # default='warn'
X = df[['inputColumn1', 'inputColumn2', ...,'inputColumn20']]
y = df['LabeledDataColumn']
X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']] = scale.fit_transform(X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']].as_matrix())
#print (X)
est = sm.OLS(y, X).fit()
est.summary()
Upvotes: 2
Views: 308
Reputation: 2253
You could use the get_dummies function pandas provides and convert the categorical values.
Something like this..
predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']),
pd.get_dummies(data['categorical_column1'], prefix='Categorical_col1'),
pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'),
axis=1)
then you could get the outcome/label column by doing
outcome = predictor['label']
del predictor['label']
Then call the model on the data doing
est = sm.OLS(outcome, predictor).fit()
Upvotes: 1