MultiLabelBinarizer in Spark?

I'm looking for an equivalent transformer such as the MultiLabelBinarizer in sklearn.

All I found so far is this Binarizer which does not really do what I need.

I was also looking at this documentation but I can't see anything that does what I want.

My input is a columns where each element is a list of labels:

labels    
['a', 'b']
['a']
['c', 'b']
['a', 'c']

the output should be

labels
[1, 1, 0]
[1, 0, 0]
[0, 1, 1]
[1, 0, 1]

What's the PySpark equivalent to this?

Upvotes: 8

Answers (1)

Ric S

Reputation: 9247

The following solution may not be extremely optimized, but I think it's quite simple and does its job quickly.
We basically create a function that collects all the distinct values contained in the labels column, then dynamically creates a column of 0/1 for each value encountered in the labels column.

import pyspark.sql.functions as F


def multi_label_binarizer(df, labels_col='labels', output_col='new_labels'):
    """
    Function that takes as input:
    - `df`, pyspark.sql.dataframe 
    - `labels_col`, string that indicates an array column containing labels
    - `output_col`, string that indicates the name of the new labels column
    
    and returns a multi-label binarized column.
    """
    
    # get set of unique labels and sort them
    labels_set = df\
        .withColumn('exploded', F.explode('labels'))\
        .agg(F.collect_set('exploded'))\
        .collect()[0][0]
    labels_set = sorted(labels_set)
    
    # dynamically create columns for each value in `labels_set`
    for i in labels_set:
        df = df.withColumn(i, F.when(F.array_contains(labels_col, i), 1).otherwise(0))
        
    # create new, multi-label binarized array column
    df = df.withColumn(output_col, F.array(*labels_set))
    
    return df


multi_label_binarizer(df).show()

+------+---+---+---+----------+
|labels|  a|  b|  c|new_labels|
+------+---+---+---+----------+
|[a, b]|  1|  1|  0| [1, 1, 0]|
|   [a]|  1|  0|  0| [1, 0, 0]|
|[c, b]|  0|  1|  1| [0, 1, 1]|
|[a, c]|  1|  0|  1| [1, 0, 1]|
+------+---+---+---+----------+

Upvotes: 1

MultiLabelBinarizer in Spark?

Answers (1)

Related Questions