Iteratively get the max of a data frame column, add one and repeat for all rows in spark/sql

Question

I'm using spark and mainly interacting with it via sql (I'm a novice to both, so please bear with me). I have a data set that looks like this in a table called taxonomies:

level_4         level_5    level_6    level_7       system_id   node_id
American League West       null       null           4          633
American League East       null       null           4          634
National League West       Dodgers    bellinger      3          635
National League Central    Cardinals  null           null       null
American League Central    null       null           null       null

I'm going to get the system_id columns by left joining another table; what I want to do here is get a new node_id for the null values by adding one to the max of the node_id column.

I adapted this query from an answer to a previous question:

WITH cte AS (
    SELECT max(node_id) AS maxid FROM taxonomies
  )
  SELECT cte.maxid + row_number() OVER (ORDER BY node_id) AS node_id, system_id, `level_4`, `level_5`, `level_6`, `level_7`
  FROM taxonomies
    LEFT JOIN cte WHERE taxonomies.node_id IS null
  UNION
  SELECT * FROM taxonomies WHERE node_id IS NOT null

What I don't understand is that this appropriately makes new node_id for the null values, but for everything else it swaps system_id and node_id. I've tried switching node_id and system_id after AS, but all this does is change the column names with the same results. What I want is this:

level_4         level_5    level_6    level_7       system_id   node_id
American League West       null       null           4          633
American League East       null       null           4          634
National League West       Dodgers    bellinger      3          635
National League Central    Cardinals  null           null       636
American League Central    null       null           null       637

Any help is greatly appreciated!

Deepak K · Accepted Answer

A solution to your problem can be this:

SELECT 
        system_id, 
        (min(node_id) over ())-1+row_number() OVER (ORDER BY -node_id desc) node_id, 
        `level_4`, `level_5`, `level_6`, `level_7` 
    FROM 
        taxonomies

If you want to do this using join, this is the way:

SELECT 
    system_id, 
    t2.min_node-1+row_number() OVER (ORDER BY -tn.node_id desc) node_id, 
    `level_4`, `level_5`, `level_6`, `level_7` 
FROM 
    taxonomies tn 
cross join 
    (SELECT min(node_id) as min_node FROM taxonomies) as t2

If you want to do this using sub-query, this is the way:

SELECT 
        system_id, 
        (select min(node_id) from taxonomies)-1+row_number() OVER (ORDER BY -node_id desc) node_id, 
        `level_4`, `level_5`, `level_6`, `level_7` 
    FROM 
        taxonomies

Iteratively get the max of a data frame column, add one and repeat for all rows in spark/sql

Answers (1)

Related Questions