Reputation: 222461
I have the following numpy matrix:
M = [
['a', 5, 0.2, ''],
['a', 2, 1.3, 'as'],
['b', 1, 2.3, 'as'],
]
M = np.array(M)
I would like to encode categorical values ('a', 'b', '', 'as'
). I tried to encode it using OneHotEncoder. The problem is that is does not work with string variables and generates the error.
enc = preprocessing.OneHotEncoder()
enc.fit(M)
enc.transform(M).toarray()
I know that I have to use categorical_features
to show which values I am going to encode and I thought that by providing dtype
I will be able to handle string values, but I can not. So is there a way to encode categorical values in my matrix?
Upvotes: 7
Views: 8817
Reputation: 5896
You can use DictVectorizer
:
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
dv = DictVectorizer(sparse=False)
df = pd.DataFrame(M).convert_objects(convert_numeric=True)
dv.fit_transform(df.to_dict(orient='records'))
array([[ 5. , 0.2, 1. , 0. , 1. , 0. ],
[ 2. , 1.3, 1. , 0. , 0. , 1. ],
[ 1. , 2.3, 0. , 1. , 0. , 1. ]])
dv.feature_names_
holds correspondence to the columns:
[1, 2, '0=a', '0=b', '3=', '3=as']
Upvotes: 17