Amber Z.
Amber Z.

Reputation: 379

How to set reference levels in a Spark ML Logistic Regression using OneHotEncoder

I'm working in PySpark using Spark 2.1 to prepare my data to build a logistic regression. I have several string variables in my data and I want to set the most frequent category as the reference level. I first use StringIndexer to encode the string column into label indices and I know these are ordered by label frequencies with the most frequent receiving the index of 0.

stringIndexer = StringIndexer(inputCol='income_grp', outputCol="income_grp_indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)

+-------------+------------------+
|   income_grp|income_grp_indexed|
+-------------+------------------+
|200000_299999|               0.0|
|300000_499999|               1.0|
|100000_199999|               2.0|
|500000_749999|               3.0|
|  less_100000|               4.0|
|750000_999999|               5.0|
|   ge_1000000|               6.0|
+-------------+------------------+

Then I use OneHotEncoder to map the column of label indices to a column of binary vectors. However, I only see an option in OneHotEncoder to drop the last level, which is the least frequent category.

encoder = OneHotEncoder(dropLast=True, inputCol="income_grp_indexed", outputCol="income_grp_encoded")
encoded = encoder.transform(indexed)

+-------------+------------------+------------------+
|   income_grp|income_grp_indexed|income_grp_encoded|
+-------------+------------------+------------------+
|200000_299999|               0.0|     (6,[0],[1.0])|
|300000_499999|               1.0|     (6,[1],[1.0])|
|100000_199999|               2.0|     (6,[2],[1.0])|
|500000_749999|               3.0|     (6,[3],[1.0])|
|  less_100000|               4.0|     (6,[4],[1.0])|
|750000_999999|               5.0|     (6,[5],[1.0])|
|   ge_1000000|               6.0|         (6,[],[])|
+-------------+------------------+------------------+

How can I remove the most frequent category of each of my string variables?

Upvotes: 3

Views: 363

Answers (1)

Andre Barbosa
Andre Barbosa

Reputation: 140

I know that it´s an old question, and my answer may not work for the version 2.1 of Spark, but for Spark 3.1.2 (which is the version that I´m using), the StringIndexer has an argument stringOrderType, which can be set to 'frequencyAsc'. If you do this, the last index will be the one with the highest frequency, and it will be dropped at the OneHotEncoder.

So you can do:

    stringIndexer = StringIndexer(inputCol='income_grp', 
    outputCol="income_grp_indexed", stringOrderType='frequencyAsc')

    # The rest is the same
    encoder = OneHotEncoder(dropLast=True, inputCol="income_grp_indexed", outputCol="income_grp_encoded")
    encoded = encoder.transform(indexed)
    ...

This is specially relevant if we are dealing with pyspark.ml.regression.GeneralizedLinearRegression, which is a model that can output the statistical p.values for each coefficient, and those p.values can change based in the frequency of the base level.

Upvotes: 0

Related Questions