pythonpandasnumpydata-sciencecategorical-data

Reputation: 3193

Dealing with categorical variables - Looking for recommendations

I have the following dataset, in which the Direccion del viento (Pos) column has categorical values

In total Direccion del viento (Pos) has 8 categories:

SO - Sur oeste
SE - Sur este
S - Sur
N - Norte
NO - Nor oeste
NE - Nor este
O - Oeste
E - Este

Then, I convert this dataframe to numpy array and I get:

direccion_viento_pos
dtype: bool
[['S']
 ['S']
 ['S']
 ...
 ['SO']
 ['NO']
 ['SO']]

Since I have character string values, I want these to be numeric values, so I need to code the categorical variables. That is, coding the text we have as numerical values

Then I perform two activities:

I use LabelEncoder() to simply encode the values into number according to how many categories I have.

Label encoding is simply converting each value in a column to a number

labelencoder_direccion_viento_pos = LabelEncoder()
direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])

I use OneHotEncoding to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column:

This is:

onehotencoder = OneHotEncoder(categorical_features = [0])
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()

Is of this way, since I get these new values:

direccion_viento_pos
array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Then I convert this direccion_viento_pos array to dataframe to visualize it of a best way:

# Turn array to dataframe with columns indexes
cols = ['E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO']
df_direccion_viento = pd.DataFrame(direccion_viento_pos, columns=cols)

Then, I can get by each category value a new column and assigns a 1 or 0 (True/False) value to the column.

If I use pandas.get_dummies() function I get the same result.

My question is: Is this the best way of dealing with these categorical variables? Having a column for each category and having values of zeros in several of them does not help to have a bias or noise in the data for when automatic learning algorithms are applied?

I’ve recently started reading about it in this article, but any guidance on this I appreciate

UPDATE

I have been reading about new ways of manage these categorical variables mentioned above, and I am found with the following:

In this link of Jupyter notebooks exercises (on cell number 59) belong to Hands-on Machine Learning with Scikit-Learn and TensorFlow book, the author speaks about of LabelEncoder the following:

Warning: earlier versions of the book used the LabelEncoder class or Pandas' Series.factorize() method to encode string categorical attributes as integers. However, the OrdinalEncoder class that is planned to be introduced in Scikit-Learn 0.20 (see PR #10521) is preferable since it is designed for input features (X instead of labels y)

This means that LabelEncoder is used for encoding the dependent variable, instead of the input features. My direccion_viento categorical variables dataset are input features.

Initially, on the scikit-learn dev version 0.20 it existed CategoricalEncoder. I copy this class into a categorical_encoder.py file and apply it:

from __future__ import unicode_literals
import pandas as pd

# I import the Categorical Encoder locally from my project environment
from notebooks.DireccionDelViento.sklearn.preprocessing.categorical_encoder import CategoricalEncoder

# Read the dataset
direccion_viento = pd.read_csv('Direccion del viento.csv', )

# No null values
print(direccion_viento.isnull().any())
direccion_viento.isnull().values.any()

# We select only the first  Direccion Viento (pos) column
direccion_viento = direccion_viento[['Direccion del viento (Pos)']]

encoder = CategoricalEncoder(encoding='onehot-dense', handle_unknown='ignore')
dir_viento_encoder = encoder.fit_transform(direccion_viento[['Direccion del viento (Pos)']])
print(" These are the categories", encoder.categories_)

cols = ['E', 'N', 'NE', 'NO', 'O', 'S','SE','SO']
df_direccion_viento = pd.DataFrame(dir_viento_encoder, columns=cols)

And the resulting dataset is similar to use LabelEncoding and OneHotEncoding

The difference between use OneHotEncoder() and use CategoricalEncoder() is that when I use CategoricalEncoder() is not necessary apply LabelEncoder(), and the reason is that CategoricalEncoder can deal directly with strings and I do not need to convert my variable values into integers first.

This means, that CategoricalEncoder if it's the same as OneHotEncoder or the result of apply them, is the same really ...

After, reading and searching about, with respect to CategoricalEncoder() class, Aurélien Géron tell us in their book that CategoricalEncoder will be deprecated int the scikit-learn-0.20 stable version.

In fact, scikit-learn team in their current master branch denote that CategoricalEncoder()

CategoricalEncoder briefly existed in 0.20dev. Its functionality has been rolled into the OneHotEncoder and OrdinalEncoder.

This pull request named Rethinking the CategoricalEncoder API ?, too denote the workflow process to deprecate CategoricalEncoder()

Then according to the above, I've applied OrdinalEncoder, and the result that I get is the same as when I apply LabelEncoder only

from __future__ import unicode_literals
# from .future_encoders import OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Read the dataset
direccion_viento = pd.read_csv('Direccion del viento.csv', )

# No null values
print(direccion_viento.isnull().any())
direccion_viento.isnull().values.any()

# We select only the first column Direccion Viento (pos)
direccion_viento = direccion_viento[['Direccion del viento (Pos)']]
print(direccion_viento.head(10))

ordinal_encoder = OrdinalEncoder()
direccion_viento_cat_encoded = ordinal_encoder.fit_transform(direccion_viento)

And I get this array, which is a similar result to when I used LabelEncoder():

What is the difference between OrdinalEncoder and LabelEncoder taking as a reference your concepts:

LabelEncoder() to simply encode the values into number according to how many categories I have. Label encoding is simply converting each value in a column to a number

and

OrdinalEncoder: Encode categorical features as an integer array. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature

Can I choose the resulting dataset which is created from apply OneHotEncoding technique or select the dataset which is created from apply OrdinalEncoder technique? What is the most appropriate?

I think so, that is necessary distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered.

Sebastian Raschka in their Python Machine Learning book says this sample in relation to categorical data

For example, T-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of T-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.

My direccion_viento values ('E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO') does not have any order or any value is greater than or less than other. Would not it make sense to consider them as ordinal in nature? really?

In this sense, until this moment I think so that the OneHotEncoding is the best option to my direccion_viento input features

Somebody tell me before the following:

Depends on what you plan to do with the data. There are various ways to work with categorical variable. You need to pick the more appropriate > for the model/situation you are working on by investigating if the approach you are taking is right for the model you are using.

I will work with models like clustering, linear regression, and neural networks.

How to can I know if OrdinalEncoder or OneHotEncoder is the most appropriate?

Upvotes: 1

Answers (2)

Anna Veronika Dorogush

Reputation: 1223

Try using CatBoost (https://catboost.ai, https://github.com/catboost/catboost) - a gradient boosting library that deals with categorical features.

Upvotes: 2

steveorsomething

Reputation: 81

In short: Yes, this is a common and accepted way of transforming your categorical variables.

As for whether this method would introduce more noise: The amount of information present is identical, so this alone wouldn't have any effect. If you're worried about the columns that now have only 0 values, that's a matter of your data and sampling quality. If you have no instances of (for example) Este, it will be ignored completely by the algorithm—in which case you may want to find some instances to include.

You may want to also google 'imbalanced classes', which is what you're dealing with here.

Upvotes: 2

Dealing with categorical variables - Looking for recommendations

Answers (2)

Related Questions