roqds
roqds

Reputation: 201

Pandas dataframe encode Categorical variable with thousands of unique values

I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:

school city          category   capacity
1      azez6576sebd  45         23
2      dsqozbc765aj  12         236
3      sqdqsd12887s  8          63 
4      azez6576sebd  7          234 
...

How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?

Thank you.

Upvotes: 16

Views: 40549

Answers (3)

aamir23
aamir23

Reputation: 1233

An optimal way, that's used in production ML systems & Kaggle competitions is to use embeddings, like their target statistics. So for a binary target variable you can calculate the following for each of the distinct categorical values.

1) No of positive labels 2) No of Negative labels 3) Ratio

Here's a video explaining it - Large-Scale Learning - Dr. Mikhail Bilenko

Hash encoders are also suitable for your situation of 'city' column having a few thousand distinct values. This method scales pretty well. You need to specify the number of binary output columns that you want as output.

Another option for supervised learning cases is Target Encoder or James Stein encoder. This technique replaces each category with the average value of the target for rows with the category. But if your dataset sample isnt very large, and you have only a few examples per category this method may not be very useful.

Here's a helpful blogpost that I referred to - Encoding Categorical Variables

Upvotes: 3

BENY
BENY

Reputation: 323286

You can using category dtype in sklearn , it should be labelencoder

df.city=df.city.astype('category').cat.codes
df
Out[385]: 
   school  city  category  capacity
0       1     0        45        23
1       2     1        12       236
2       3     2         8        63
3       4     0         7       234

Upvotes: 28

cs95
cs95

Reputation: 402553

A few thousand columns is still manageable in the context of ML classifiers. Although you'd want to watch out for the curse of dimensionality.

That aside, you wouldn't want a get_dummies call to result in a memory blowout, so you could generate a SparseDataFrame instead -

v = pd.get_dummies(df.set_index('school').city, sparse=True)
v

        azez6576sebd  dsqozbc765aj  sqdqsd12887s
school                                          
1                  1             0             0
2                  0             1             0
3                  0             0             1
4                  1             0             0

type(v)
pandas.core.sparse.frame.SparseDataFrame

You can generate a sparse matrix using sdf.to_coo -

v.to_coo()

<4x3 sparse matrix of type '<class 'numpy.uint8'>'
    with 4 stored elements in COOrdinate format>

Upvotes: 5

Related Questions