Reputation: 291
Suppose we want to track the hops made by an package from warehouse to the customer. We have a table which store the data but the data is in a column SAY Route The package starts at the Warehouse – YYY,TTT,MMM The hops end when the package is delivered to the CUSTOMER The values in the Route column are separated by space
ID Route
1 TTT A B X Y Z CUSTOMER
2 YYY E Y F G I P B X Q CUSTOMER
3 MMM R T K L CUSTOMER
Expected Output
ID START END
1 TTT A
1 A B
1 B X
.
.
.
1 Z CUSTOMER
2 YYY E
2 E Y
2 Y F
.
.
2 Q CUSTOMER
3 MMM R
.
.
3 L CUSTOMER
Is there anyway to achieve this in pyspark
Upvotes: 1
Views: 98
Reputation: 42342
Add an index to the split route using posexplode
, and get the location at the next index for each starting location using lead
. If you want to remove the index simply add .drop('index')
at the end.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df2 = df.select(
'ID',
F.posexplode(F.split('Route', ' ')).alias('index', 'start')
).withColumn(
'end',
F.lead('start').over(Window.partitionBy('ID').orderBy('index'))
).orderBy('ID', 'index').dropna()
df2.show(99,0)
+---+-----+-----+--------+
|ID |index|start|end |
+---+-----+-----+--------+
|1 |0 |TTT |A |
|1 |1 |A |B |
|1 |2 |B |X |
|1 |3 |X |Y |
|1 |4 |Y |Z |
|1 |5 |Z |CUSTOMER|
|2 |0 |YYY |E |
|2 |1 |E |Y |
|2 |2 |Y |F |
|2 |3 |F |G |
|2 |4 |G |I |
|2 |5 |I |P |
|2 |6 |P |B |
|2 |7 |B |X |
|2 |8 |X |Q |
|2 |9 |Q |CUSTOMER|
|3 |0 |MMM |R |
|3 |1 |R |T |
|3 |2 |T |K |
|3 |3 |K |L |
|3 |4 |L |CUSTOMER|
+---+-----+-----+--------+
Upvotes: 1