Snehasish Das
Snehasish Das

Reputation: 291

Exploding an array into 2 columns

Suppose we want to track the hops made by an package from warehouse to the customer. We have a table which store the data but the data is in a column SAY Route The package starts at the Warehouse – YYY,TTT,MMM The hops end when the package is delivered to the CUSTOMER The values in the Route column are separated by space

ID  Route   
1   TTT A B X Y Z CUSTOMER
2   YYY E Y F G I P B X Q CUSTOMER
3   MMM R T K L CUSTOMER

Expected Output

ID START    END
1   TTT     A
1   A       B
1   B       X
.
.
.
1   Z       CUSTOMER
2   YYY     E
2   E       Y
2   Y       F
.
.
2   Q       CUSTOMER
3   MMM     R
.
.
3   L       CUSTOMER

Is there anyway to achieve this in pyspark

Upvotes: 1

Views: 98

Answers (1)

mck
mck

Reputation: 42342

Add an index to the split route using posexplode, and get the location at the next index for each starting location using lead. If you want to remove the index simply add .drop('index') at the end.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df2 = df.select(
    'ID',
    F.posexplode(F.split('Route', ' ')).alias('index', 'start')
).withColumn(
    'end', 
    F.lead('start').over(Window.partitionBy('ID').orderBy('index'))
).orderBy('ID', 'index').dropna()

df2.show(99,0)
+---+-----+-----+--------+
|ID |index|start|end     |
+---+-----+-----+--------+
|1  |0    |TTT  |A       |
|1  |1    |A    |B       |
|1  |2    |B    |X       |
|1  |3    |X    |Y       |
|1  |4    |Y    |Z       |
|1  |5    |Z    |CUSTOMER|
|2  |0    |YYY  |E       |
|2  |1    |E    |Y       |
|2  |2    |Y    |F       |
|2  |3    |F    |G       |
|2  |4    |G    |I       |
|2  |5    |I    |P       |
|2  |6    |P    |B       |
|2  |7    |B    |X       |
|2  |8    |X    |Q       |
|2  |9    |Q    |CUSTOMER|
|3  |0    |MMM  |R       |
|3  |1    |R    |T       |
|3  |2    |T    |K       |
|3  |3    |K    |L       |
|3  |4    |L    |CUSTOMER|
+---+-----+-----+--------+

Upvotes: 1

Related Questions