Reputation: 13
I have a dataset similar to this one
exp | pid | mat | pskey | order |
---|---|---|---|---|
1 | CR | P | 1-CR-P | 1 |
1 | M | C | 1-M-C | 2 |
1 | CR | C | 1-CR-C | 3 |
1 | PP | C | 1-PP-C | 4 |
2 | CR | P | 2-CR-P | 1 |
2 | CR | P | 2-CR-P | 1 |
2 | M | C | 2-M-C | 2 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | CR | C | 2-CR-C | 3 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
2 | PP | C | 2-PP-C | 4 |
3 | M | C | 3-M-C | 2 |
4 | CR | P | 4-CR-P | 1 |
4 | M | C | 4-M-C | 2 |
4 | CR | C | 4-CR-C | 3 |
4 | PP | C | 4-PP-C | 4 |
What I need is to get pskey of the predecessor for the same exp giving the following relation:
order 1 -> no predecessor
order 2 -> no predecessor
order 3 -> [1,2]
order 4 -> [3]
And add those values to a new column called predecessor
The expected result would be like:
+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------------------------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ] |
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
|4 |CR |C |4-CR-C|3 |[4-CR-P, 4-M-C] |
|4 |PP |C |4-PP-C|4 |[4-CR-C] |
+---+---+---+------+-----+----------------------------------------+
I am quite new to pyspark so I have no clue how to manage it.
Upvotes: 1
Views: 110
Reputation: 15258
Differents cases on order
are handled with when
. You aggregate the values with a collect_set
to get unic identifiers:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
"predecessor",
F.when(
F.col("order") == 3,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
),
).when(
F.col("order") == 4,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
),
),
)
the result :
df2.show(truncate=False)
+---+---+---+------+-----+----------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ]|
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
+---+---+---+------+-----+----------------+
only showing top 20 rows
Upvotes: 2