user2682459
user2682459

Reputation: 1029

Select first row each time column changes

I have the following table in a postgres database, with 3 columns:

ReaderId: String
TagId: String
Timestamp: Timestamp

ReaderId  TagId  Timestamp
A         T1      20190101-00:00:00  *  ~      
A         T1      20190101-00:00:00     ~
A         T1      20190101-00:00:01    
A         T1      20190101-00:00:02   
B         T1      20190101-00:00:03  *
B         T1      20190101-00:00:03 
B         T1      20190101-00:00:04   
A         T1      20190101-00:00:05  * 
A         T1      20190101-00:00:06 
A         T1      20190101-00:00:07   
C         T1      20190101-00:00:08  *
C         T1      20190101-00:00:09   
B         T2      20190101-00:00:01  *
B         T2      20190101-00:00:04 
B         T2      20190101-00:00:05   
C         T2      20190101-00:00:06  *
C         T2      20190101-00:00:07   
B         T2      20190101-00:00:07  *   ~
B         T2      20190101-00:00:07      ~
B         T2      20190101-00:00:08   

I would like a query/function that, when supplied with a TagId, returns the first row each time that tag is read at a different reader to which it was last read at (or the first row if the tag has never been read before). The rows which are eligible to be returned according to this criteria are highlighted by a * above. If there are multiple rows which are identical and 'equally first' then only one of these should be returned (as in the case of the rows marked with a ~ above).

This function needs to be performant as the volume of data is expected to easily grow into the many millions/low billions of rows. I can create any indexes required.

My SQL is rusty, and was never very good in the first place, so any help here is appreciated!

Upvotes: 0

Views: 876

Answers (3)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656331

Use lag() like others already suggested. But you specified:

when supplied with a TagId

So you can simplify. A bit faster, too:

SELECT reader_id, tag_id, ts
FROM  (
   SELECT *, lag(reader_id) OVER (ORDER BY ts) IS DISTINCT FROM reader_id AS pick
   FROM   tbl
   WHERE  tag_id = 'T1'  --  your tag_id here
   ) sub
WHERE  pick;

db<>fiddle here

Also works for NULL values in the column reader_id.

You can wrap this in an SQL function or prepared statement and only pass your tag_id.

Upvotes: 2

klin
klin

Reputation: 121514

Use the window function lag():

select 
    reader_id, tag_id, timestamp
from (
    select
        reader_id, tag_id, timestamp,
        lag(reader_id) over (partition by tag_id order by timestamp)
    from my_table
    ) s
where lag is distinct from reader_id
order by tag_id, timestamp

Window functions are expensive but alternative solutions (if exist) rather cannot be cheaper. The index on (tag_id, timestamp) will support the query.

Online demo on db<>fiddle.

Read also about window functions in the documentation.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269503

Simply use lag():

select t.*
from (select t.*,
             lag(ReaderId) over (partition by TagId order by Timestamp) as prev_ReaderId
      from t
     ) t
where prev_ReaderId is null or prev_ReaderId <> ReaderId;

In Postgres, you can shorten the where clause to:

where prev_ReaderId is distinct from ReaderId

Upvotes: 1

Related Questions