Boris K
Boris K

Reputation: 3580

Python SciKitLearn and Pandas categoric data

I'm working on multivariable regression from a csv, predicting crop performance based on multiple factors. Some of my columns are numerical and meaningful. Others are numerical and categorical, or strings and categorical (for instance, crop variety, or plot code or whatever.) How do I teach Python to use them? I've found One Hot Encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but don't really understand how to apply it here.

My code so far:

import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('filepath.csv')

df.drop(df[df['LabeledDataColumn'].isnull()].index.tolist(),inplace=True)

scale = StandardScaler()

pd.options.mode.chained_assignment = None  # default='warn'
X = df[['inputColumn1', 'inputColumn2', ...,'inputColumn20']]
y = df['LabeledDataColumn']

X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']] = scale.fit_transform(X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']].as_matrix())

#print (X)

est = sm.OLS(y, X).fit()

est.summary()

Upvotes: 2

Views: 308

Answers (1)

Gayatri
Gayatri

Reputation: 2253

You could use the get_dummies function pandas provides and convert the categorical values.

Something like this..

predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']),
                           pd.get_dummies(data['categorical_column1'], prefix='Categorical_col1'),
                           pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'),
                          axis=1)

then you could get the outcome/label column by doing

outcome = predictor['label']
del predictor['label']

Then call the model on the data doing

est = sm.OLS(outcome, predictor).fit()

Upvotes: 1

Related Questions