Reputation: 21
I want to transform below source dataframe (using pyspark):
Key | ID | segment |
---|---|---|
1 | A | m1 |
2 | A | m1 |
3 | B | m1 |
4 | C | m2 |
1 | D | m1 |
2 | E | m1 |
3 | F | m1 |
4 | G | m2 |
1 | J | m1 |
2 | J | m1 |
3 | J | m1 |
4 | K | m2 |
Into below result dataframe:
ID | key1 | key2 |
---|---|---|
A | 1 | 2 |
B | 3 | - |
C | 4 | - |
D | 1 | - |
F | 3 | - |
G | 4 | - |
J | 1 | 2 |
J | 1 | 3 |
J | 2 | 3 |
K | 4 | - |
In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.
Thank you for your help
Upvotes: 1
Views: 441
Reputation: 26676
Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.
w =Window.partitionBy('ID').orderBy(desc('Key'))
df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
.withColumn('key2', col('key2').isNotNull())# query to create boolean selection
.withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
.orderBy('ID', 'key')#Reorder frame
)
df.show()
+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
| 1| A| m1| 2|
| 2| A| m1| 2|
| 3| B| m1| 1|
| 4| C| m2| 1|
| 1| D| m1| 1|
| 2| E| m1| 1|
| 3| F| m1| 1|
| 4| G| m2| 1|
| 1| J| m1| 2|
| 2| J| m1| 3|
| 3| J| m1| 3|
| 4| K| m2| 1|
+---+---+-------+----+
Upvotes: 1