Exploding an array into 2 columns

Question

Suppose we want to track the hops made by an package from warehouse to the customer. We have a table which store the data but the data is in a column SAY Route The package starts at the Warehouse – YYY,TTT,MMM The hops end when the package is delivered to the CUSTOMER The values in the Route column are separated by space

ID  Route   
1   TTT A B X Y Z CUSTOMER
2   YYY E Y F G I P B X Q CUSTOMER
3   MMM R T K L CUSTOMER

Expected Output

ID START    END
1   TTT     A
1   A       B
1   B       X
.
.
.
1   Z       CUSTOMER
2   YYY     E
2   E       Y
2   Y       F
.
.
2   Q       CUSTOMER
3   MMM     R
.
.
3   L       CUSTOMER

Is there anyway to achieve this in pyspark

mck · Accepted Answer

Add an index to the split route using posexplode, and get the location at the next index for each starting location using lead. If you want to remove the index simply add .drop('index') at the end.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df2 = df.select(
    'ID',
    F.posexplode(F.split('Route', ' ')).alias('index', 'start')
).withColumn(
    'end', 
    F.lead('start').over(Window.partitionBy('ID').orderBy('index'))
).orderBy('ID', 'index').dropna()

df2.show(99,0)
+---+-----+-----+--------+
|ID |index|start|end     |
+---+-----+-----+--------+
|1  |0    |TTT  |A       |
|1  |1    |A    |B       |
|1  |2    |B    |X       |
|1  |3    |X    |Y       |
|1  |4    |Y    |Z       |
|1  |5    |Z    |CUSTOMER|
|2  |0    |YYY  |E       |
|2  |1    |E    |Y       |
|2  |2    |Y    |F       |
|2  |3    |F    |G       |
|2  |4    |G    |I       |
|2  |5    |I    |P       |
|2  |6    |P    |B       |
|2  |7    |B    |X       |
|2  |8    |X    |Q       |
|2  |9    |Q    |CUSTOMER|
|3  |0    |MMM  |R       |
|3  |1    |R    |T       |
|3  |2    |T    |K       |
|3  |3    |K    |L       |
|3  |4    |L    |CUSTOMER|
+---+-----+-----+--------+

Exploding an array into 2 columns

Answers (1)

Related Questions