PySpark: how to use `StringIndexer` to do label encoding with the string array column

Question

As we know, we can do LabelEncoder() by StringIndexer in the string column, but if want to do LabelEncoder() on string array column, it is not easy to implement.

# input
df.show()

+--------------------------------------+
|                                  tags|
+--------------------------------------+
|        [industry, display, Merchants]|
|    [smart, swallow, game, Experience]|
|             [social, picture, social]|
|        [default, game, us, adventure]|
| [financial management, loan, product]|
|       [system, profile, optimization]|

...
# After do LabelEncoder() on `tags` column 
...

+--------------------------------------+
|                                  tags|
+--------------------------------------+
|                             [0, 1, 2]|
|                          [3, 4, 4, 5]|
|                             [6, 7, 6]|
|                         [8, 4, 9, 10]|
|                          [11, 12, 13]|
|                          [14, 15, 16]|

chlebek · Accepted Answer

Python version will be very similar:

// add unique id to each row
val df2 = df.withColumn("id", monotonically_increasing_id).select('id, explode('tags).as("tag"))

val indexer = new StringIndexer()
  .setInputCol("tag")
  .setOutputCol("tagIndex")

val indexed = indexer.fit(df2).transform(df2)

// in the final step you should convert tags back to array of tags
val dfFinal = indexed.groupBy('id).agg(collect_list('tagIndex))

PySpark: how to use `StringIndexer` to do label encoding with the string array column

Answers (2)

Related Questions