Reputation: 28
I am building a machine learning pipeline in PySpark with a StringIndexer
as one of the stages. The problem is that some of the categories are very small so I would like them to be mapped to the same label. This is possible with the OrdinalEncoder
from scikit-learn.
I think I am looking for a way to extend the class StringIndexer but I cannot for the life of me figure out how to do that. I figured I would have to override the fit
(or _fit
) method but I can't even find that in the source code. Other suggestions are welcome.
Here is a small example:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
spark = SparkSession.builder.getOrCreate()
data = [
("a",),
("b",),
("a",),
("b",),
("c",),
("b",),
("b",),
("d",),
("d",),
("d",),
("e",)
]
data = spark.createDataFrame(data, ["category"])
indexer = StringIndexer(inputCol='category', outputCol='label')
categories_and_labels = indexer.fit(data).transform(data)
Above code gives this result:
+--------+-----+
|category|label|
+--------+-----+
| a| 2.0|
| b| 0.0|
| a| 2.0|
| b| 0.0|
| c| 3.0|
| b| 0.0|
| b| 0.0|
| d| 1.0|
| d| 1.0|
| d| 1.0|
| e| 4.0|
+--------+-----+
I would like a custom class with a parameter minFrequency
that I can use like in the following:
indexer = CustomStringIndexer(inputCol='category', outputCol='label', minFrequency=3)
categories_and_labels = indexer.fit(data).transform(data)
The expected result would then be
+--------+-----+
|category|label|
+--------+-----+
| a| 2.0|
| b| 0.0|
| a| 2.0|
| b| 0.0|
| c| 2.0|
| b| 0.0|
| b| 0.0|
| d| 1.0|
| d| 1.0|
| d| 1.0|
| e| 2.0|
+--------+-----+
I am using Spark version 3.5
Upvotes: 0
Views: 62