Reputation: 1
I have a code snippet here from the project I am working on. I don't understand what exactly is the function of ROW_NUMBER() and rownumber outside the parenthesis.
select az.* , ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber
I am supposed to convert this code to pyspark. Need some help in the conversion.
Upvotes: 0
Views: 732
Reputation: 1
The PySpark code to the Oracle SQL code written above is as follows:
t3 = az.select(az["*"],(sf.row_number().over(Window.partitionBy("txn_no","seq_no").orderBy("txn_no","seq_no"))).alias("rownumber"))
Now as said above, order by here seems unwanted as it repeats the same cols which indeed result in continuously changing of row_numbers but thats what my project head is asking for. Changes to the answer above are highly welcomed! Thank You!
Upvotes: 0
Reputation: 74605
ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber
means "break the results into groups where all rows in each group have the same value for txn_no/seq_no
, then number them sequentially increasing in order of txn_no/seq_no
(which doesn't make sense; the person who wrote this might not have known what they were doing) and the finalrownumber
is just an alias name for this generated column
All in, for a resultset like:
txn_no, seq_no
123, 123
123, 123
123, 124
124, 124
124, 124
It would end up like:
txn_no, seq_no, rownumber
123, 123, 1
123, 123, 2
123, 124, 1
124, 124, 1
124, 124, 2
124, 124, 3
124, 124, 4
i.e. the rownumber restarts every time txn_no/seq_no pairing changes
As noted, it doesn't make much sense. This is a more valid use of ROW_NUMBER:
person, address, livedthereuntil
john, palm springs, 2019-12-31
john, capitol hill, 2020-12-31
john, sunset strip, 2018-12-31
ROW_NUMBER() OVER(PARTITION BY person ORDER BY livedthereuntil DESC) rn
person, address, livedthereuntil, rn
john, capitol hill, 2020-12-31, 1
john, palm springs, 2019-12-31, 2
john, sunset strip, 2018-12-31, 3
Thus we can say John's most recent address is rn 1, or other things like "he lived in sunset strip before he lived in his previous house (rn=3)" as an answer to "where was he living 2 houses ago?"
Typically we partition by one thing but order by another before we start numbering.. for this reason. Stuff like "Get me the 3 most recent transactions for every person who has an account with us" -> over(partition by account number order by transaction_date desc) as rn ... where rn < 4
. It doesn't make sense to order by the same thing you partition by because logically everything in the partition has the same value, so ordering it is useless
I don't know pyspark but I hope this helps you replicate whatever that row number was for
Upvotes: 1