0x3d
0x3d

Reputation: 470

scikit learn LinearRegression string predicted value

After some courses and examples done using tutorials, I try to create my first machine learning model. I got training data from here: https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv and I'm using panda to load this csv data.

Main problem is that predicted column is string and all algorithms works with floats.

Sure I can map manually all strings with numbers (0,1,2) and use changed file but I try to figure out a method to replace automatically string values using panda or scikit-learn and save them mapped in a separated array.

My code is:

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

data = pd.read_csv("https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv")

data.head()

features_cols = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
X = df[features_cols]
y = data.Name

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
linreg = LinearRegression()
linreg.fit(X_train, y_train)

The error that is see is:

ValueError: could not convert string to float: 'Iris-setosa'

How I can replace using panda all value from "Name" column with integers ?

Upvotes: 0

Views: 2460

Answers (2)

Tonechas
Tonechas

Reputation: 13733

I recommend you to import the iris dataset directly from scikit-learn like this:

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

Demo:

In [9]: from sklearn.cross_validation import train_test_split

In [10]: from sklearn.linear_model import LinearRegression

In [11]: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [12]: linreg = LinearRegression()

In [13]: linreg.fit(X_train, y_train)
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [14]: linreg.score(X_test, y_test)
Out[14]: 0.89946565707178838

In [15]: y
Out[15]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Upvotes: 0

Scratch'N'Purr
Scratch'N'Purr

Reputation: 10399

You can use scikit-learn's LabelEncoder

>>> from pandas import pd
>>> from sklearn import preprocessing
>>> df = pd.DataFrame({'Name':['Iris-setosa','Iris-setosa','Iris-versicolor','Iris-virginica','Iris-setosa','Iris-versicolor'], 'a': [1,2,3,4,1,1]})
>>> y = df.Name
>>> le = preprocessing.LabelEncoder()
>>> le.fit(y)  # fit your y array
LabelEncoder()
>>> le.classes_  # check your unique classes
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
>>> y_transformed = le.transform(y)  # transform your y with numeric encodings
>>> y_transformed
array([0, 0, 1, 2, 0, 1], dtype=int64)

Upvotes: 1

Related Questions