Reputation: 43
I have noticed OneHotEncoder fails when a categorical variable column has 6 or more categories. For instance, I have this TestData.csv file that has two columns: Geography, and Continent. Geography's distinct values are France, Spain, Kenya, Botswana, and Nigeria, while Continent's distinct values are Europe, and Africa. My goal is to encode the Geography column using OneHotEncoder. I perform the following code to do this:
import numpy as np
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('TestData.csv')
X = dataset.iloc[:,:].values #X is hence a 2-dimensional numpy.ndarray
#Encoding categorical column Geography
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') #the 0 is the column index for the categorical column we want to encode, in this case Geography
X = np.array(ct.fit_transform(X))
I then print(X)
to make sure I get the expected output which I do and it looks like this (also notice the Size of X):
However, if I add one new country to the TestData file, let's say Belgium. We now have 6 distinct countries. And now running the exact same code produces the following:
It fails at the line
X = np.array(ct.fit_transform(X))
As you can see, X is not changed and there is no encoding done. I have tested this multiple times. So it seems like OneHotEncoder can only handle up to 5 different category values. Is there a parameter that I can change or another method I can do to encode categorical variables with more than 5 values?
PS - I know to remove the dummy variable after the encoding ;)
I am running Python 3.7.7
Thanks!
Upvotes: 1
Views: 946
Reputation: 5036
I noticed you are already using pandas
. Then there is another way to use one-hot encoding. @the_martian solution is a better answer to your question. My answer is more like an extended comment.
Preparing example data similar to yours.
import numpy as np
import pandas as pd
a = np.random.choice(['afr','deu','swe','fi','rus','eng','wu'], 40)
b = np.random.choice(['eu','as'], 40)
df = pd.DataFrame({'a':a, 'b':b})
df.head()
Output
a b
0 rus as
1 eng as
2 fi eu
3 swe eu
4 eng eu
You can use get_dummies
for one-hot encoding
pd.get_dummies(df, columns=['a'])
Output(clipped)
b a_afr a_deu a_eng a_fi a_rus a_swe a_wu
0 eu 0 0 0 1 0 0 0
1 eu 0 0 0 0 1 0 0
2 as 0 0 0 0 1 0 0
3 eu 0 0 0 1 0 0 0
4 eu 0 0 0 0 0 0 1
5 as 0 0 0 0 0 1 0
...
Upvotes: 1
Reputation: 291
I think the issue is with the “sparse_threshold” parameter in ColumnTransformer. Try setting it to 0 so all output numpy arrays are dense. The density of your output is falling below 0.3 (the default value) which prompts it to try to switch to sparse arrays but it still contains the string column Continent and sparse arrays can’t contain strings.
Upvotes: 3