Reputation: 401
I have a data set containing both categorical and numerical columns and my target column is also categorical. I am using Scikit library in Python34. I know that Scikit needs all categorical values to be transformed to numerical values before doing any machine learning approach.
How should I transform my categorical columns to numerical values? I tried a lot of thing but I am getting different errors such as "str" object has no 'numpy.ndarray' object has no attribute 'items'.
Here is an example of my data:
UserID LocationID AmountPaid ServiceID Target
29876 IS345 23.9876 FRDG JFD
29877 IS712 135.98 WERS KOI
My dataset is saved in a CSV file, here is the little code I wrote to give you an idea about what I want to do:
#reading my csv file
data_dir = 'C:/Users/davtalab/Desktop/data/'
train_file = data_dir + 'train.csv'
train = pd.read_csv( train_file )
#numeric columns:
x_numeric_cols = train['AmountPaid']
#Categrical columns:
categorical_cols = ['UserID' + 'LocationID' + 'ServiceID']
x_cat_cols = train[categorical_cols].as_matrix()
y_target = train['Target'].as_matrix()
I need x_cat_cols to be converted to numeric values and the add them to x_numeric_cols and so have my complete input (x) values.
Then I need to convert my target function into numeric value as well and make that as my final target (y) column.
Then I want to do a Random Forest using these two complete sets as:
rf = RF(n_estimators=n_trees,max_features=max_features,verbose =verbose, n_jobs =n_jobs)
rf.fit( x_train, y_train )
Thanks for your help!
Upvotes: 8
Views: 4732
Reputation: 401
This was because of the way I enumerate the data. If I print the data (using another sample) you will see:
>>> import pandas as pd
>>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
... 'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
>>> samples = [dict(enumerate(sample)) for sample in train]
>>> samples
[{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}]
This is a list of dicts. We should do this instead:
>>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
>>> train_as_dicts
[{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'},
{'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'},
{'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}]
Now we need to vectorize the dicts:
>>> from sklearn.feature_extraction import DictVectorizer
>>> vectorizer = DictVectorizer()
>>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts)
>>> vectorized_sparse
<3x7 sparse matrix of type '<type 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
>>> vectorized_array = vectorized_sparse.toarray()
>>> vectorized_array
array([[ 1., 0., 0., 1., 0., 1., 0.],
[ 0., 1., 1., 0., 1., 1., 0.],
[ 1., 0., 1., 1., 0., 0., 1.]])
To get the meaning of each column, ask the vectorizer:
>>> vectorizer.get_feature_names()
['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f']
Upvotes: 0
Reputation: 1967
For target, you can use sklearn's LabelEncoder. This will give you a converter from string labels to numeric ones (and also a reverse mapping). Example in the link.
As for features, learning algorithms in general expect (or work best with) ordinal data. So the best option is to use OneHotEncoder to convert the categorical features. This will generate a new binary feature for each category, denoting on/off for each category. Again, usage example in the link.
Upvotes: 4