Christoph Ehler
Christoph Ehler

Reputation: 11

Convert string (object datatype) to categorical data

I had the issue in the past and hope you can help. I am trying to convert columns of 'object' datatype into categorical values.

So:

print(df_train['aus_heiz_befeuerung'].unique())

['Gas' 'Unbekannt' 'Alternativ' 'Öl' 'Elektro' 'Kohle']

These values from the above columns should be converted to. eg. 1, 2, 4, 5, 3.

Unfortunately I can not figure out how.

I have tried different astype versions and the following code block:

# string label to categorical values
from sklearn.preprocessing import LabelEncoder

for i in range(df_train.shape[1]):
    if df_train.iloc[:,i].dtypes == object:
        lbl = LabelEncoder()
        lbl.fit(list(df_train.iloc[:,i].values) + list(df_test.iloc[:,i].values))
        df_train.iloc[:,i] = lbl.transform(list(df_train.iloc[:,i].values))
        df_test.iloc[:,i] = lbl.transform(list(df_test.iloc[:,i].values))

print(df_train['aus_heiz_befeuerung'].unique())

It leads to : TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

So happy for all ideas.

Upvotes: 0

Views: 1497

Answers (1)

Liberty Lover
Liberty Lover

Reputation: 1012

You can use the pandas.Categorical() function to convert the values in a column to categorical values. For example, to convert the values in the aus_heiz_befeuerung column to categorical values, you can use the following code:

df_train['aus_heiz_befeuerung'] = 
pd.Categorical(df_train['aus_heiz_befeuerung'])

This will assign a numerical value to each unique category in the column, so that the values in the column become integers instead of strings. You can specify the order in which the categories should be assigned numerical values by passing a list of category names to the categories parameter of the pandas.Categorical() function. For example, to assign the categories in the order specified in your question ('Gas', 'Unbekannt', 'Alternativ', 'Öl', 'Elektro', 'Kohle'), you can use the following code:

df_train['aus_heiz_befeuerung'] = pd.Categorical(df_train['aus_heiz_befeuerung'], categories=['Gas', 'Unbekannt', 'Alternativ', 'Öl', 'Elektro', 'Kohle'])

After you have converted the values in the column to categorical values, you can use the cat.codes property to access the integer values that have been assigned to each category. For example:

df_train['aus_heiz_befeuerung'].cat.codes

This will return a pandas.Series object containing the integer values that have been assigned to each category in the aus_heiz_befeuerung column.

Upvotes: 1

Related Questions