Reputation: 209
I'm new to data analytics. I'm trying some models in python Sklearn. I have a dataset in which some of the columns have text columns. Like below,
Dataset
Is there a way to convert these column values into numbers in pandas or Sklearn?. Assigning numbers to these values will be right?. And what if a new string pops out in test data?.
Please advice.
Upvotes: 10
Views: 15085
Reputation: 501
I think it would be better to use OrdinalEncoder if you want to transform feature columns, because it's meant for categorical features (LabelEncoder is meant for labels). Also, it can handle values not seen in training and multiple features at the same time. An example:
from sklearn.preprocessing import OrdinalEncoder
features = ["city", "age", ...]
encoder = OrdinalEncoder(
handle_unknown='use_encoded_value',
unknown_value=-1
).fit(train[features])
train[features] = encoder.transform(train[features])
test[features] = encoder.transform(test[features])
More on the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
Upvotes: 0
Reputation: 2529
Consider using Label Encoding - it transforms the categorical data by assigning each category an integer between 0 and the num_of_categories-1:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['letter'])
letter
0 a
1 b
2 c
3 d
4 a
5 c
6 a
Applying:
le = LabelEncoder()
encoded_series = df[df.columns[:]].apply(le.fit_transform)
encoded_series:
letter
0 0
1 1
2 2
3 3
4 0
5 2
6 0
7 3
Upvotes: 3
Reputation: 36555
You can convert them into integer codes by using the categorical datatype.
column = column.astype('category')
column_encoded = column.cat.codes
As long as use use a tree based model with deep enough trees, eg GradientBoostingClassifier(max_depth=10
), your model should be able to split out the categories again.
Upvotes: 0