Francesco De Santis
Francesco De Santis

Reputation: 183

Encode column of lists into integer in pyspark

I have a pyspark dataframe like this one:

 --------------------
| id | configuration |
|----|---------------|
| 1  | [c1, c2, a1]  |
| 2  | [c1, c2, a1]  |
| 3  | [z1, x6, a8]  |
 --------------------

I want to encode the configuration column into a column of integer, the following is the desired dataframe:

 -----------------------------
| id | configuration | labels |
|----|---------------|--------|
| 1  | [c1, c2, a1]  |    1   |
| 2  | [c1, c2, a1]  |    1   |
| 3  | [z1, x6, a8]  |    2   |
 -----------------------------

How can i perform this operation?

Upvotes: 0

Views: 57

Answers (1)

wwnde
wwnde

Reputation: 26676

Window functions, dense_rank()

df.withColumn('labels', dense_rank().over(Window.partitionBy().orderBy('configuration'))).show()

+---+-------------+------+
| id|configuration|labels|
+---+-------------+------+
|  1| [c1, c2, a1]|     1|
|  2| [c1, c2, a1]|     1|
|  3| [z1, x6, a8]|     2|
+---+-------------+------+

Upvotes: 1

Related Questions