How to Automatically Dummy Code High Cardinality Variables in Python

Question

I am working my way through the data engineer salary data set on Kaggle. The salary_currency column has the following value counts.

salary_currency
USD 13695
GBP   558
EUR   406
INR    51
CAD    49
...

16494 values total

Is there a way to dummy code only for values that are at least 2% (or any percent) of a given column? In other words only dummy code for USD, GBP, and EUR?

Johnny Cheesecutter · Accepted Answer

Yes, simply use latest version of OHE

from sklearn.preprocessing import OneHotEncoder

oh = OneHotEncoder(min_frequency = 0.02, sparse_output = False)
data = oh.fit_transform(df[['salary_currency']])
cols = oh.get_feature_names_out()
features = pd.DataFrame(data,columns=cols)
features.sum(axis=0)

Returns following counts by columns

salary_currency_CAD                    18.0
salary_currency_EUR                    95.0
salary_currency_GBP                    44.0
salary_currency_INR                    27.0
salary_currency_USD                   398.0
salary_currency_infrequent_sklearn     25.0
``

How to Automatically Dummy Code High Cardinality Variables in Python

Answers (1)

Related Questions