Uttara
Uttara

Reputation: 125

Treating data as categorical in linear regression

I have data in a csv file that looks somewhat like this:

column1    column2
   b          2
   c          4
   z          1
   g          3
...

(This is not the real data) Column1 is categorical and column2 is continuous and I want to carry out linear regression on this data. My code looks like this at the moment:

# Function to get data from the csv file.
def import_data(file_name):
 df = pd.read_csv(file_name).drop_duplicates()
 X_parameter = []
 Y_parameter = []
 for alpha, beta in zip(df['column1'], df['column2']):
       X_parameter.append([float(alpha)])
       Y_parameter.append(float(beta))
 return X_parameter, Y_parameter


X, Y = import_data(filename)
def linear_model_main(X_parameters, Y_parameters, predict_value):

 # Create linear regression object

 regress = linear_model.LinearRegression()
 regress.fit(X_parameters, Y_parameters)
 prediction_outcome = regress.predict(predict_value)
 predictions = {}
 predictions['intercept'] = regress.intercept_
 predictions['coefficient'] = regress.coef_
 predictions['predicted_value'] = prediction_outcome
 return predictions

I'm not sure how to specify in this code that column1 is categorical? I tried changing it to numerical data (a = 1, b = 2, ...) but Python is treating it as continuous.

Upvotes: 0

Views: 2232

Answers (1)

Alexander
Alexander

Reputation: 109520

You can use get_dummies to return them as dummy variables

>>> pd.concat([df, pd.get_dummies(df.column1)], axis=1)
  column1  column2  b  c  g  z
0       b        2  1  0  0  0
1       c        4  0  1  0  0
2       z        1  0  0  0  1
3       g        3  0  0  1  0

EDIT:

del df['column1']
df = df[['b', 'c', 'g', 'z', 'column2']]
>>> df
   b  c  g  z  column2
0  1  0  0  0        2
1  0  1  0  0        4
2  0  0  0  1        1
3  0  0  1  0        3

regress.fit(df.iloc[:, :-1].values, df.iloc[:, -1].values)

Upvotes: 3

Related Questions