How to set reference levels in a Spark ML Logistic Regression using OneHotEncoder

Question

I'm working in PySpark using Spark 2.1 to prepare my data to build a logistic regression. I have several string variables in my data and I want to set the most frequent category as the reference level. I first use StringIndexer to encode the string column into label indices and I know these are ordered by label frequencies with the most frequent receiving the index of 0.

stringIndexer = StringIndexer(inputCol='income_grp', outputCol="income_grp_indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)

+-------------+------------------+
|   income_grp|income_grp_indexed|
+-------------+------------------+
|200000_299999|               0.0|
|300000_499999|               1.0|
|100000_199999|               2.0|
|500000_749999|               3.0|
|  less_100000|               4.0|
|750000_999999|               5.0|
|   ge_1000000|               6.0|
+-------------+------------------+

Then I use OneHotEncoder to map the column of label indices to a column of binary vectors. However, I only see an option in OneHotEncoder to drop the last level, which is the least frequent category.

encoder = OneHotEncoder(dropLast=True, inputCol="income_grp_indexed", outputCol="income_grp_encoded")
encoded = encoder.transform(indexed)

+-------------+------------------+------------------+
|   income_grp|income_grp_indexed|income_grp_encoded|
+-------------+------------------+------------------+
|200000_299999|               0.0|     (6,[0],[1.0])|
|300000_499999|               1.0|     (6,[1],[1.0])|
|100000_199999|               2.0|     (6,[2],[1.0])|
|500000_749999|               3.0|     (6,[3],[1.0])|
|  less_100000|               4.0|     (6,[4],[1.0])|
|750000_999999|               5.0|     (6,[5],[1.0])|
|   ge_1000000|               6.0|         (6,[],[])|
+-------------+------------------+------------------+

How can I remove the most frequent category of each of my string variables?

How to set reference levels in a Spark ML Logistic Regression using OneHotEncoder

Answers (1)

Related Questions