Daniel
Daniel

Reputation: 13

Pyspark get predecessor value

I have a dataset similar to this one

exp pid mat pskey order
1 CR P 1-CR-P 1
1 M C 1-M-C 2
1 CR C 1-CR-C 3
1 PP C 1-PP-C 4
2 CR P 2-CR-P 1
2 CR P 2-CR-P 1
2 M C 2-M-C 2
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 CR C 2-CR-C 3
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
2 PP C 2-PP-C 4
3 M C 3-M-C 2
4 CR P 4-CR-P 1
4 M C 4-M-C 2
4 CR C 4-CR-C 3
4 PP C 4-PP-C 4

What I need is to get pskey of the predecessor for the same exp giving the following relation:

order 1 -> no predecessor

order 2 -> no predecessor

order 3 -> [1,2]

order 4 -> [3]

And add those values to a new column called predecessor

The expected result would be like:

+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor                             |
+---+---+---+------+-----+----------------------------------------+
|1  |CR |P  |1-CR-P|1    |null                                    |
|1  |M  |C  |1-M-C |2    |null                                    |
|1  |CR |C  |1-CR-C|3    |[1-CR-P, 1-M-C ]                        |
|1  |PP |C  |1-PP-C|4    |[1-CR-C]                                |
|3  |M  |C  |3-M-C |2    |null                                    |
|2  |CR |P  |2-CR-P|1    |null                                    |
|2  |CR |P  |2-CR-P|1    |null                                    |
|2  |M  |C  |2-M-C |2    |null                                    |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|4  |CR |P  |4-CR-P|1    |null                                    |
|4  |M  |C  |4-M-C |2    |null                                    |
|4  |CR |C  |4-CR-C|3    |[4-CR-P, 4-M-C]                         |
|4  |PP |C  |4-PP-C|4    |[4-CR-C]                                |
+---+---+---+------+-----+----------------------------------------+

I am quite new to pyspark so I have no clue how to manage it.

Upvotes: 1

Views: 110

Answers (1)

Steven
Steven

Reputation: 15258

Differents cases on order are handled with when. You aggregate the values with a collect_set to get unic identifiers:

from pyspark.sql import functions as F, Window 

df2 = df.withColumn(
    "predecessor",
    F.when(
        F.col("order") == 3,
        F.collect_set(F.col("pskey")).over(
            Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
        ),
    ).when(
        F.col("order") == 4,
        F.collect_set(F.col("pskey")).over(
            Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
        ),
    ),
)

the result :

df2.show(truncate=False)
+---+---+---+------+-----+----------------+                                     
|exp|pid|mat|pskey |order|predecessor     |
+---+---+---+------+-----+----------------+
|1  |CR |P  |1-CR-P|1    |null            |
|1  |M  |C  |1-M-C |2    |null            |
|1  |CR |C  |1-CR-C|3    |[1-CR-P, 1-M-C ]|
|1  |PP |C  |1-PP-C|4    |[1-CR-C]        |
|3  |M  |C  |3-M-C |2    |null            |
|2  |CR |P  |2-CR-P|1    |null            |
|2  |CR |P  |2-CR-P|1    |null            |
|2  |M  |C  |2-M-C |2    |null            |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|4  |CR |P  |4-CR-P|1    |null            |
|4  |M  |C  |4-M-C |2    |null            |
+---+---+---+------+-----+----------------+
only showing top 20 rows

Upvotes: 2

Related Questions