Maya
Maya

Reputation: 28

PySpark: StringIndexer with min_frequency like in scikit-learn's OrdinalEncoder

I am building a machine learning pipeline in PySpark with a StringIndexer as one of the stages. The problem is that some of the categories are very small so I would like them to be mapped to the same label. This is possible with the OrdinalEncoder from scikit-learn.

I think I am looking for a way to extend the class StringIndexer but I cannot for the life of me figure out how to do that. I figured I would have to override the fit (or _fit) method but I can't even find that in the source code. Other suggestions are welcome.

Here is a small example:

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer

spark = SparkSession.builder.getOrCreate()

data = [
    ("a",),
    ("b",),
    ("a",),
    ("b",),
    ("c",),
    ("b",),
    ("b",),
    ("d",),
    ("d",),
    ("d",),
    ("e",)
]

data = spark.createDataFrame(data, ["category"])

indexer = StringIndexer(inputCol='category', outputCol='label')
categories_and_labels = indexer.fit(data).transform(data)

Above code gives this result:

+--------+-----+
|category|label|
+--------+-----+
|       a|  2.0|
|       b|  0.0|
|       a|  2.0|
|       b|  0.0|
|       c|  3.0|
|       b|  0.0|
|       b|  0.0|
|       d|  1.0|
|       d|  1.0|
|       d|  1.0|
|       e|  4.0|
+--------+-----+

I would like a custom class with a parameter minFrequency that I can use like in the following:

indexer = CustomStringIndexer(inputCol='category', outputCol='label', minFrequency=3)
categories_and_labels = indexer.fit(data).transform(data)

The expected result would then be

+--------+-----+
|category|label|
+--------+-----+
|       a|  2.0|
|       b|  0.0|
|       a|  2.0|
|       b|  0.0|
|       c|  2.0|
|       b|  0.0|
|       b|  0.0|
|       d|  1.0|
|       d|  1.0|
|       d|  1.0|
|       e|  2.0|
+--------+-----+

I am using Spark version 3.5

Upvotes: 0

Views: 62

Answers (0)

Related Questions