Reputation: 470
After some courses and examples done using tutorials, I try to create my first machine learning model. I got training data from here: https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv and I'm using panda to load this csv data.
Main problem is that predicted column is string and all algorithms works with floats.
Sure I can map manually all strings with numbers (0,1,2) and use changed file but I try to figure out a method to replace automatically string values using panda or scikit-learn and save them mapped in a separated array.
My code is:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv("https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv")
data.head()
features_cols = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
X = df[features_cols]
y = data.Name
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
The error that is see is:
ValueError: could not convert string to float: 'Iris-setosa'
How I can replace using panda all value from "Name" column with integers ?
Upvotes: 0
Views: 2460
Reputation: 13733
I recommend you to import the iris dataset directly from scikit-learn like this:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
Demo:
In [9]: from sklearn.cross_validation import train_test_split
In [10]: from sklearn.linear_model import LinearRegression
In [11]: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [12]: linreg = LinearRegression()
In [13]: linreg.fit(X_train, y_train)
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [14]: linreg.score(X_test, y_test)
Out[14]: 0.89946565707178838
In [15]: y
Out[15]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Upvotes: 0
Reputation: 10399
You can use scikit-learn's LabelEncoder
>>> from pandas import pd
>>> from sklearn import preprocessing
>>> df = pd.DataFrame({'Name':['Iris-setosa','Iris-setosa','Iris-versicolor','Iris-virginica','Iris-setosa','Iris-versicolor'], 'a': [1,2,3,4,1,1]})
>>> y = df.Name
>>> le = preprocessing.LabelEncoder()
>>> le.fit(y) # fit your y array
LabelEncoder()
>>> le.classes_ # check your unique classes
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
>>> y_transformed = le.transform(y) # transform your y with numeric encodings
>>> y_transformed
array([0, 0, 1, 2, 0, 1], dtype=int64)
Upvotes: 1