Reputation: 201
I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:
school city category capacity
1 azez6576sebd 45 23
2 dsqozbc765aj 12 236
3 sqdqsd12887s 8 63
4 azez6576sebd 7 234
...
How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?
Thank you.
Upvotes: 16
Views: 40549
Reputation: 1233
An optimal way, that's used in production ML systems & Kaggle competitions is to use embeddings, like their target statistics. So for a binary target variable you can calculate the following for each of the distinct categorical values.
1) No of positive labels 2) No of Negative labels 3) Ratio
Here's a video explaining it - Large-Scale Learning - Dr. Mikhail Bilenko
Hash encoders are also suitable for your situation of 'city' column having a few thousand distinct values. This method scales pretty well. You need to specify the number of binary output columns that you want as output.
Another option for supervised learning cases is Target Encoder or James Stein encoder. This technique replaces each category with the average value of the target for rows with the category. But if your dataset sample isnt very large, and you have only a few examples per category this method may not be very useful.
Here's a helpful blogpost that I referred to - Encoding Categorical Variables
Upvotes: 3
Reputation: 323286
You can using category dtype in sklearn , it should be labelencoder
df.city=df.city.astype('category').cat.codes
df
Out[385]:
school city category capacity
0 1 0 45 23
1 2 1 12 236
2 3 2 8 63
3 4 0 7 234
Upvotes: 28
Reputation: 402553
A few thousand columns is still manageable in the context of ML classifiers. Although you'd want to watch out for the curse of dimensionality.
That aside, you wouldn't want a get_dummies
call to result in a memory blowout, so you could generate a SparseDataFrame
instead -
v = pd.get_dummies(df.set_index('school').city, sparse=True)
v
azez6576sebd dsqozbc765aj sqdqsd12887s
school
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
type(v)
pandas.core.sparse.frame.SparseDataFrame
You can generate a sparse matrix using sdf.to_coo
-
v.to_coo()
<4x3 sparse matrix of type '<class 'numpy.uint8'>'
with 4 stored elements in COOrdinate format>
Upvotes: 5