CuriousDeveloper
CuriousDeveloper

Reputation: 899

DENSE_RANK breaking sequentially when changed

The grp column should be represent a block partitioned by GroupId, RouteId and ordered LengthStart where as you travel through the blocks if the locationId remains the same the grp is the same. A break in the sequence creates a new grp

+---------+----------+--------------+------------+-------------+------+
| GROUPID |  ROUTEID |  LENGTHSTART |  LENGTHEND |  LOCATIONID |  GRP |
+---------+----------+--------------+------------+-------------+------+
|     1   | A        | 0            | 1          | 1           | 1    |
|     1   | A        | 1            | 2          | 1           | 1    |
|     1   | A        | 2            | 3          | 2           | 2    |
|     1   | A        | 3            | 4          | 1           | 3    |
|     2   | A        | 2            | 3          | 2           | 4    |
|     1   | B        | 2            | 3          | 2           | 5    |
|     1   | A        | 4            | 5          | 1           | 3    |
+---------+----------+--------------+------------+-------------+------+

My search on this problem led me to this solution: DENSE_RANK according to particular order

My attempt at making the grp calculation:

SELECT *, ROW_NUMBER() OVER (ORDER BY GroupId, RouteId, LengthStart) - ROW_NUMBER() OVER (PARTITION BY GroupId, RouteId, LocationId ORDER BY GroupId, RouteId, LengthStart) AS grp
FROM mytable

I tried to adopt that solution such that I could have more levels of partitioning and it works in really basic use-cases (like the one showcased above but in complicated scenarios its failing).

I don't really understand completely why two row_number() are subtracting and how it works out but it worked really well in the simple example.

I tried other approaches uses LAG but just fail to understand how to take the logic and apply it.

Here is a fiddle with the more complicated scenerio:

https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=3704dfe8583b0dd020b189184d149cb7

You can see one of the many mistakes I've been seeing highlighted here: enter image description here

Upvotes: 1

Views: 638

Answers (2)

Paul Maxwell
Paul Maxwell

Reputation: 35603

By altering the parameters of the over clauses will produce different result. Note you don't need to repeat columns used for partitioning in the order by. The concept of this technique is to calculate a common value (in column grp) that hat is often used in subsequent calculations such as; number of steps, or cost, or min/max etc.

SELECT
    *
  , ROW_NUMBER() OVER (ORDER BY GroupId, RouteId, LengthStart) 
  - ROW_NUMBER() OVER (PARTITION BY GroupId, RouteId ORDER BY LengthStart ) AS grp
FROM mytable
ORDER BY
    GroupId
  , RouteId
  , LengthStart

It is useful to output the two row_number calculation so that you can see how this works:

    Id      GroupId   RouteId   LengthStart   LengthEnd   LocationId   rn1   rn2   grp  
 --------- --------- --------- ------------- ----------- ------------ ----- ----- ----- 
  2651246         3   AAA         0.0000000   0.0920000      1884268     1     1     0  
  2651247         3   AAA         0.0920000   0.5800000      1855305     2     2     0  
  2651248         3   AAA         0.5800000   1.3610000      1884268     3     3     0  
  2651249         3   AAA         1.3610000   1.6170000      1884268     4     4     0  
  2651250         3   AAA         1.6170000   2.3750000      1884268     5     5     0  
  2681493         3   BBB         0.0000000   1.5600000      1864963     6     1     5  
  2681494         3   BBB         1.5600000   2.7100000      1864963     7     2     5  
  2681495         3   BBB         2.7100000   3.3900000      1864963     8     3     5  
  2954915         3   CCC         0.0000000   0.0500000      1883382     9     1     8  
  2954916         3   CCC         0.0500000   0.1400000      1846300    10     2     8  
  …                                                                                     

The rn1 column starts at 1 and keeps incrementing. The second rn column restarts at 1 with each partition, but as this also increments by one on the same order sequence, when you deduct rn2 from rn1 you get a "constant" result for grp for each partition used in rn2

Upvotes: 0

MatBailie
MatBailie

Reputation: 86765

This looks to be a gaps-and-islands problem where you have mis-understood the solution.

In this case the grp column doesn't identify a gap or island on it's own, it just gives you the extra information you need to identify them.

In your case, the way you've implemented it (locationId, grp) forms the new "sub_group" identifier.

I'd also change your query slightly to make it easy to understand what's happening...

SELECT
  *,
    ROW_NUMBER() OVER (PARTITION BY GroupID, RouteID             ORDER BY LengthStart)
  - ROW_NUMBER() OVER (PARTITION BY GroupId, RouteId, LocationId ORDER BY LengthStart) AS grp
FROM
  mytable
ORDER BY
  GroupId, RouteId, LengthStart

That makes it more explicit that different GroupID, RouteID have nothing to do with each other when working out the gaps and islands (sub-groups).

Then I can show a simplified example...

  LocationID, Start, SetRowNum, LocRowNum, Difference, GroupID

      1        000       1       1          0          (1,0)
      1        100       2       2          0          (1,0)

      2        200       3         1          2        (2,2)
      2        300       4         2          2        (2,2)

      1        400       5       3          2          (1,2)

      2        500       6         3          3        (2,3)

      3        600       7           1          6      (3,6)

      2        700       8         4          4        (2,4)
      2        800       9         5          4        (2,4)

An island is worked out for each LocationID separately.

A gap is just any row for any other LocationID.

The "trick" is that within each island both the rownumbers are increasing together. By both increasing together, the difference remains the same. That difference is the island_id.

Then, during a gap, the first rownumber is increasing, such that when we reach the next island the difference between the two rownums have increased, giving a new island_id **for that LocationID.


Remember, when working out the island_id for location_1 we treat all other locations as gaps between location 1's islands.

For Location 1, we have islands at rows 1,2 and 5, with IslandID of 0 and 2 respectively.

  LocationID, Start, SetRowNum, LocRowNum, Difference, GROUP_ID

      1        000       1       1          0          (1,0)
      1        100       2       2          0          (1,0)

      GAP

      1        400       5       3          2          (1,2)

For Location 2, we have islands at rows 3,4, 6 and 8,9 with IslandID of 2, 3, 4 respectively.

  LocationID, Start, SetRowNum, LocRowNum, Difference

     GAP

      2        200       3         1          2        (2,2)
      2        300       4         2          2        (2,2)

     GAP

      2        500       6         3          3        (2,3)

     GAP

      2        700       8         4          4        (2,4)
      2        800       9         5          4        (2,4)

For Location 3, we have islands at row 7 with IslandID of 6.

  LocationID, Start, SetRowNum, LocRowNum, Difference

     GAP

      3        600       7           1          6      (3,6)

     GAP

All in all, the islands all have different ID's.

But Location1 and Location2 both have islands with IslandID = 2.

  • So, you need to use (locationID, IslandID) together
  • The composite key will be unique for that partition

Upvotes: 1

Related Questions