zerkms
zerkms

Reputation: 254976

Partitioning function for continuous sequences

There is a table of the following structure:

CREATE TABLE history
(
  pk serial NOT NULL,
  "from" integer NOT NULL,
  "to" integer NOT NULL,
  entity_key text NOT NULL,
  data text NOT NULL,
  CONSTRAINT history_pkey PRIMARY KEY (pk)
);

The pk is a primary key, from and to define a position in the sequence and the sequence itself for a given entity identified by entity_key. So the entity has one sequence of 2 rows in case if the first row has the from = 1; to = 2 and the second one has from = 2; to = 3. So the point here is that the to of the previous row matches the from of the next one.

The order to determine "next"/"previous" row is defined by pk which grows monotonously (since it's a SERIAL).

The sequence does not have to start with 1 and the to - from does not necessary 1 always. So it can be from = 1; to = 10. What matters is that the "next" row in the sequence matches the to exactly.

Sample dataset:

pk  |  from  |  to  |  entity_key  |  data
----+--------+------+--------------+-------
1   |   1    |   2  |      42      |  foo
2   |   2    |   3  |      42      |  bar
3   |   3    |   4  |      42      |  baz
4   |  10    |  11  |      42      |  another foo
5   |  11    |  12  |      42      |  another baz
6   |   1    |   2  |     111      |  one one one
7   |   2    |   3  |     111      |  one one one two
8   |   3    |   4  |     111      |  one one one three

And what I cannot realize is how to partition by "sequences" here so that I could apply window functions to the group that represents a single "sequence".

Let's say I want to use the row_number() function and would like to get the following result:

pk  |  row_number | entity_key
----+-------------+------------
1   |     1       |       42
2   |     2       |       42
3   |     3       |       42
4   |     1       |       42
5   |     2       |       42
6   |     1       |      111
7   |     2       |      111
8   |     3       |      111

For convenience I created an SQLFiddle with initial seed: http://sqlfiddle.com/#!15/e7c1c

PS: It's not the "give me the codez" question, I made my own research and I just out of ideas how to partition.

It's obvious that I need to LEFT JOIN with the next.from = curr.to, but then it's still not clear how to reset the partition on next.from IS NULL.

PS: It will be a 100 points bounty for the most elegant query that provides the requested result

PPS: the desired solution should be an SQL query not pgsql due to some other limitations that are out of scope of this question.

Upvotes: 4

Views: 271

Answers (4)

Teve
Teve

Reputation: 56

Well, this not exactly one SQL query but:

select a.pk as PK, a.entity_key as ENTITY_KEY, b.pk as BPK, 0 as Seq into #tmp 
from history a left join history b on a."to" = b."from" and a.pk = b.pk-1 

declare @seq int

select @seq = 1

update #tmp set Seq =  case when (BPK is null) then @seq-1 else @seq end,  
@seq = case when (BPK is null) then @seq+1 else @seq end

select pk, entity_key, ROW_NUMBER() over (PARTITION  by entity_key, seq order by pk asc) 
from #tmp order by pk

This is in SQL Server 2008

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270181

You can use generate_series() to generate all the rows between the two values. Then you can use the difference of row numbers on that:

select pk, "from", "to",
       row_number() over (partition by entity_key, min(grp) order by pk) as row_number
from (select h.*,
             (row_number() over (partition by entity_key order by ind) -
              ind) as grp
      from (select h.*, generate_series("from", "to" - 1) as ind
            from history h
           ) h
     ) h
 group by pk, "from", "to", entity_key

Because you specify that the difference is between 1 and 10, this might actually not have such bad performance.

Unfortunately, your SQL Fiddle isn't working right now, so I can't test it.

Upvotes: 2

wildplasser
wildplasser

Reputation: 44250

Just for fun & completeness: a recursive solution to reconstruct the (doubly) linked lists of records. [ this will not be the fastest solution ]

NOTE: I commented out the ascending pk condition(s) since they are not needed for the connection logic.

WITH RECURSIVE zzz AS (
        SELECT h0.pk
        , h0."to" AS next
        , h0.entity_key AS ek
        , 1::integer AS rnk
        FROM history h0
        WHERE NOT EXISTS (
                SELECT * FROM history nx
                WHERE nx.entity_key = h0.entity_key
                AND nx."to" = h0."from"
                -- AND nx.pk > h0.pk
                )
        UNION ALL
        SELECT h1.pk
        , h1."to" AS next
        , h1.entity_key AS ek
        , 1+zzz.rnk AS rnk
        FROM zzz
        JOIN history h1
          ON h1.entity_key = zzz.ek
                AND h1."from" = zzz.next
                -- AND h1.pk > zzz.pk
        )
SELECT * FROM zzz
ORDER BY ek,pk
        ;

Upvotes: 2

Steve Kass
Steve Kass

Reputation: 7184

I don’t know if it counts as “elegant,” but I think this will do what you want:

with Lagged as (
  select
    pk,
    case when lag("to",1) over (order by pk) is distinct from "from" then 1 else 0 end as starts,
    entity_key
  from history
), LaggedGroups as (
  select
    pk,
    sum(starts) over (order by pk) as groups,
    entity_key
  from Lagged
)
  select
    pk,
    row_number() over (
      partition by groups
      order by pk
    ) as "row_number",
    entity_key
from LaggedGroups

Upvotes: 7

Related Questions