Reputation: 125
I have data in a csv
file that looks somewhat like this:
column1 column2
b 2
c 4
z 1
g 3
...
(This is not the real data) Column1
is categorical and column2
is continuous and I want to carry out linear regression on this data. My code looks like this at the moment:
# Function to get data from the csv file.
def import_data(file_name):
df = pd.read_csv(file_name).drop_duplicates()
X_parameter = []
Y_parameter = []
for alpha, beta in zip(df['column1'], df['column2']):
X_parameter.append([float(alpha)])
Y_parameter.append(float(beta))
return X_parameter, Y_parameter
X, Y = import_data(filename)
def linear_model_main(X_parameters, Y_parameters, predict_value):
# Create linear regression object
regress = linear_model.LinearRegression()
regress.fit(X_parameters, Y_parameters)
prediction_outcome = regress.predict(predict_value)
predictions = {}
predictions['intercept'] = regress.intercept_
predictions['coefficient'] = regress.coef_
predictions['predicted_value'] = prediction_outcome
return predictions
I'm not sure how to specify in this code that column1
is categorical? I tried changing it to numerical data (a = 1, b = 2, ...
) but Python is treating it as continuous.
Upvotes: 0
Views: 2232
Reputation: 109520
You can use get_dummies
to return them as dummy variables
>>> pd.concat([df, pd.get_dummies(df.column1)], axis=1)
column1 column2 b c g z
0 b 2 1 0 0 0
1 c 4 0 1 0 0
2 z 1 0 0 0 1
3 g 3 0 0 1 0
EDIT:
del df['column1']
df = df[['b', 'c', 'g', 'z', 'column2']]
>>> df
b c g z column2
0 1 0 0 0 2
1 0 1 0 0 4
2 0 0 0 1 1
3 0 0 1 0 3
regress.fit(df.iloc[:, :-1].values, df.iloc[:, -1].values)
Upvotes: 3