Daniel Fourie
Daniel Fourie

Reputation: 43

Why does OneHotEncoder only work for up to 5 different categorical variable values?

I have noticed OneHotEncoder fails when a categorical variable column has 6 or more categories. For instance, I have this TestData.csv file that has two columns: Geography, and Continent. Geography's distinct values are France, Spain, Kenya, Botswana, and Nigeria, while Continent's distinct values are Europe, and Africa. My goal is to encode the Geography column using OneHotEncoder. I perform the following code to do this:

import numpy as np
import pandas as pd

#Importing the dataset
dataset = pd.read_csv('TestData.csv')
X = dataset.iloc[:,:].values #X is hence a 2-dimensional numpy.ndarray

#Encoding categorical column Geography
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') #the 0 is the column index for the categorical column we want to encode, in this case Geography
X = np.array(ct.fit_transform(X))

I then print(X) to make sure I get the expected output which I do and it looks like this (also notice the Size of X): notice the size of X

However, if I add one new country to the TestData file, let's say Belgium. We now have 6 distinct countries. And now running the exact same code produces the following: enter image description here

It fails at the line

X = np.array(ct.fit_transform(X))

As you can see, X is not changed and there is no encoding done. I have tested this multiple times. So it seems like OneHotEncoder can only handle up to 5 different category values. Is there a parameter that I can change or another method I can do to encode categorical variables with more than 5 values?

PS - I know to remove the dummy variable after the encoding ;)

I am running Python 3.7.7

Thanks!

Upvotes: 1

Views: 946

Answers (2)

Michael Szczesny
Michael Szczesny

Reputation: 5036

I noticed you are already using pandas. Then there is another way to use one-hot encoding. @the_martian solution is a better answer to your question. My answer is more like an extended comment.

Preparing example data similar to yours.

import numpy as np
import pandas as pd

a = np.random.choice(['afr','deu','swe','fi','rus','eng','wu'], 40)
b = np.random.choice(['eu','as'], 40)

df = pd.DataFrame({'a':a, 'b':b})
df.head()

Output

     a   b
0  rus  as
1  eng  as
2   fi  eu
3  swe  eu
4  eng  eu

You can use get_dummies for one-hot encoding

pd.get_dummies(df, columns=['a'])

Output(clipped)

     b  a_afr  a_deu  a_eng  a_fi  a_rus  a_swe  a_wu
0   eu      0      0      0     1      0      0     0
1   eu      0      0      0     0      1      0     0
2   as      0      0      0     0      1      0     0
3   eu      0      0      0     1      0      0     0
4   eu      0      0      0     0      0      0     1
5   as      0      0      0     0      0      1     0
...

Upvotes: 1

the_martian
the_martian

Reputation: 291

I think the issue is with the “sparse_threshold” parameter in ColumnTransformer. Try setting it to 0 so all output numpy arrays are dense. The density of your output is falling below 0.3 (the default value) which prompts it to try to switch to sparse arrays but it still contains the string column Continent and sparse arrays can’t contain strings.

Upvotes: 3

Related Questions