Selva Saravana Er
Selva Saravana Er

Reputation: 209

convert text columns into numbers in sklearn

I'm new to data analytics. I'm trying some models in python Sklearn. I have a dataset in which some of the columns have text columns. Like below,

Dataset

Is there a way to convert these column values into numbers in pandas or Sklearn?. Assigning numbers to these values will be right?. And what if a new string pops out in test data?.

Please advice.

Upvotes: 10

Views: 15085

Answers (3)

scepeda
scepeda

Reputation: 501

I think it would be better to use OrdinalEncoder if you want to transform feature columns, because it's meant for categorical features (LabelEncoder is meant for labels). Also, it can handle values not seen in training and multiple features at the same time. An example:

from sklearn.preprocessing import OrdinalEncoder

features = ["city", "age", ...]
encoder = OrdinalEncoder(
        handle_unknown='use_encoded_value', 
        unknown_value=-1
    ).fit(train[features])
train[features] = encoder.transform(train[features])
test[features] = encoder.transform(test[features])

More on the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

Upvotes: 0

Amir F
Amir F

Reputation: 2529

Consider using Label Encoding - it transforms the categorical data by assigning each category an integer between 0 and the num_of_categories-1:

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['letter'])

  letter
0      a
1      b
2      c
3      d
4      a
5      c
6      a

Applying:

le = LabelEncoder()
encoded_series = df[df.columns[:]].apply(le.fit_transform)

encoded_series:

    letter
0   0
1   1
2   2
3   3
4   0
5   2
6   0
7   3

Upvotes: 3

maxymoo
maxymoo

Reputation: 36555

You can convert them into integer codes by using the categorical datatype.

column = column.astype('category')
column_encoded = column.cat.codes

As long as use use a tree based model with deep enough trees, eg GradientBoostingClassifier(max_depth=10), your model should be able to split out the categories again.

Upvotes: 0

Related Questions