Reputation: 846
So I have a question regarding pyspark. I have a dataframe that looks like this:
+---+------------+
| id| list|
+---+------------+
| 2|[3, 5, 4, 2]|
+---+------------+
| 3|[4, 5, 3, 2]|
+---+------------+
And I would like to explode lists
it into multiple rows and keeping information about which position did each element of the list had in a separate column. The result should look like this:
+---+------------+------------+
| id| listitem| rank|
+---+------------+------------+
| 2| 3| 1|
+---+------------+------------+
| 2| 5| 2|
+---+------------+------------+
| 2| 4| 3|
+---+------------+------------+
| 2| 2| 4|
+---+------------+------------+
| 3| 4| 1|
+---+------------+------------+
| 3| 5| 2|
+---+------------+------------+
| 3| 3| 3|
+---+------------+------------+
| 3| 2| 4|
+---+------------+------------+
The rank column has the index+1 of the position each element had in the list. Any suggestions on the most optimal code to achieve it?
Upvotes: 3
Views: 2823
Reputation: 5487
You can use posexplode() or posexplode_outer() function to get desired result.
df = spark.createDataFrame([(2, [3, 5, 4, 2]), (3, [4, 5, 3, 2])], ["id", "list"])
df.select('id',posexplode_outer('list').alias('rank', 'listitem')) \
.withColumn('rank', col('rank') + 1).show()
+---+----+--------+
| id|rank|listitem|
+---+----+--------+
| 2| 1| 3|
| 2| 2| 5|
| 2| 3| 4|
| 2| 4| 2|
| 3| 1| 4|
| 3| 2| 5|
| 3| 3| 3|
| 3| 4| 2|
+---+----+--------+
Upvotes: 5