Reputation: 183
I have a pyspark dataframe like this one:
--------------------
| id | configuration |
|----|---------------|
| 1 | [c1, c2, a1] |
| 2 | [c1, c2, a1] |
| 3 | [z1, x6, a8] |
--------------------
I want to encode the configuration column into a column of integer, the following is the desired dataframe:
-----------------------------
| id | configuration | labels |
|----|---------------|--------|
| 1 | [c1, c2, a1] | 1 |
| 2 | [c1, c2, a1] | 1 |
| 3 | [z1, x6, a8] | 2 |
-----------------------------
How can i perform this operation?
Upvotes: 0
Views: 57
Reputation: 26676
Window functions, dense_rank()
df.withColumn('labels', dense_rank().over(Window.partitionBy().orderBy('configuration'))).show()
+---+-------------+------+
| id|configuration|labels|
+---+-------------+------+
| 1| [c1, c2, a1]| 1|
| 2| [c1, c2, a1]| 1|
| 3| [z1, x6, a8]| 2|
+---+-------------+------+
Upvotes: 1