Gus
Gus

Reputation: 21

Spark dataframe - transform rows with same ID to columns

I want to transform below source dataframe (using pyspark):

Key ID segment
1 A m1
2 A m1
3 B m1
4 C m2
1 D m1
2 E m1
3 F m1
4 G m2
1 J m1
2 J m1
3 J m1
4 K m2

Into below result dataframe:

ID key1 key2
A 1 2
B 3 -
C 4 -
D 1 -
F 3 -
G 4 -
J 1 2
J 1 3
J 2 3
K 4 -

In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.

Thank you for your help

Upvotes: 1

Views: 441

Answers (1)

wwnde
wwnde

Reputation: 26676

Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.

w =Window.partitionBy('ID').orderBy(desc('Key'))

df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
     .withColumn('key2', col('key2').isNotNull())# query to create boolean selection
     .withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
     .orderBy('ID', 'key')#Reorder frame
    )

df.show()

+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
|  1|  A|     m1|   2|
|  2|  A|     m1|   2|
|  3|  B|     m1|   1|
|  4|  C|     m2|   1|
|  1|  D|     m1|   1|
|  2|  E|     m1|   1|
|  3|  F|     m1|   1|
|  4|  G|     m2|   1|
|  1|  J|     m1|   2|
|  2|  J|     m1|   3|
|  3|  J|     m1|   3|
|  4|  K|     m2|   1|
+---+---+-------+----+

Upvotes: 1

Related Questions