Reputation: 25397
I'm looking for an equivalent transformer such as the MultiLabelBinarizer
in sklearn
.
All I found so far is this Binarizer
which does not really do what I need.
I was also looking at this documentation but I can't see anything that does what I want.
My input is a columns where each element is a list of labels:
labels
['a', 'b']
['a']
['c', 'b']
['a', 'c']
the output should be
labels
[1, 1, 0]
[1, 0, 0]
[0, 1, 1]
[1, 0, 1]
What's the PySpark equivalent to this?
Upvotes: 8
Views: 1326
Reputation: 9247
The following solution may not be extremely optimized, but I think it's quite simple and does its job quickly.
We basically create a function that collects all the distinct values contained in the labels
column, then dynamically creates a column of 0/1 for each value encountered in the labels
column.
import pyspark.sql.functions as F
def multi_label_binarizer(df, labels_col='labels', output_col='new_labels'):
"""
Function that takes as input:
- `df`, pyspark.sql.dataframe
- `labels_col`, string that indicates an array column containing labels
- `output_col`, string that indicates the name of the new labels column
and returns a multi-label binarized column.
"""
# get set of unique labels and sort them
labels_set = df\
.withColumn('exploded', F.explode('labels'))\
.agg(F.collect_set('exploded'))\
.collect()[0][0]
labels_set = sorted(labels_set)
# dynamically create columns for each value in `labels_set`
for i in labels_set:
df = df.withColumn(i, F.when(F.array_contains(labels_col, i), 1).otherwise(0))
# create new, multi-label binarized array column
df = df.withColumn(output_col, F.array(*labels_set))
return df
multi_label_binarizer(df).show()
+------+---+---+---+----------+
|labels| a| b| c|new_labels|
+------+---+---+---+----------+
|[a, b]| 1| 1| 0| [1, 1, 0]|
| [a]| 1| 0| 0| [1, 0, 0]|
|[c, b]| 0| 1| 1| [0, 1, 1]|
|[a, c]| 1| 0| 1| [1, 0, 1]|
+------+---+---+---+----------+
Upvotes: 1