Vedant Athavale
Vedant Athavale

Reputation: 1

How to convert PARTITION_BY and ORDER with ROW_NUMBER in Pyspark?

I have a code snippet here from the project I am working on. I don't understand what exactly is the function of ROW_NUMBER() and rownumber outside the parenthesis.

select az.* , ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber

I am supposed to convert this code to pyspark. Need some help in the conversion.

Upvotes: 0

Views: 732

Answers (2)

Vedant Athavale
Vedant Athavale

Reputation: 1

The PySpark code to the Oracle SQL code written above is as follows:

t3 = az.select(az["*"],(sf.row_number().over(Window.partitionBy("txn_no","seq_no").orderBy("txn_no","seq_no"))).alias("rownumber"))

Now as said above, order by here seems unwanted as it repeats the same cols which indeed result in continuously changing of row_numbers but thats what my project head is asking for. Changes to the answer above are highly welcomed! Thank You!

Upvotes: 0

Caius Jard
Caius Jard

Reputation: 74605

ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber means "break the results into groups where all rows in each group have the same value for txn_no/seq_no, then number them sequentially increasing in order of txn_no/seq_no (which doesn't make sense; the person who wrote this might not have known what they were doing) and the finalrownumber is just an alias name for this generated column

All in, for a resultset like:

txn_no, seq_no
123,    123
123,    123
123,    124
124,    124
124,    124

It would end up like:

txn_no, seq_no, rownumber
123,    123,    1
123,    123,    2
123,    124,    1
124,    124,    1
124,    124,    2
124,    124,    3
124,    124,    4

i.e. the rownumber restarts every time txn_no/seq_no pairing changes

As noted, it doesn't make much sense. This is a more valid use of ROW_NUMBER:

person, address,      livedthereuntil
john,   palm springs, 2019-12-31
john,   capitol hill, 2020-12-31
john,   sunset strip, 2018-12-31

ROW_NUMBER() OVER(PARTITION BY person ORDER BY livedthereuntil DESC) rn

person, address,      livedthereuntil, rn
john,   capitol hill, 2020-12-31,      1
john,   palm springs, 2019-12-31,      2
john,   sunset strip, 2018-12-31,      3

Thus we can say John's most recent address is rn 1, or other things like "he lived in sunset strip before he lived in his previous house (rn=3)" as an answer to "where was he living 2 houses ago?"

Typically we partition by one thing but order by another before we start numbering.. for this reason. Stuff like "Get me the 3 most recent transactions for every person who has an account with us" -> over(partition by account number order by transaction_date desc) as rn ... where rn < 4. It doesn't make sense to order by the same thing you partition by because logically everything in the partition has the same value, so ordering it is useless

I don't know pyspark but I hope this helps you replicate whatever that row number was for

Upvotes: 1

Related Questions