Norah Jones
Norah Jones

Reputation: 467

Optimization of attribution based on timestamps

I am struggling to optimize my query where I need to combine orders data and events data using timestamp so that I attribute certain type of the event to the order in this way that the closest event (table A) will be attributed to order (table B).

Examples: Table A:

Table B:

Currently, I am doing it using cross join but it is not performant at all:

WITH ClosestMatch AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY b.Order ORDER BY UNIX_TIMESTAMP(b.Timestamp) - UNIX_TIMESTAMP(a.Timestamp)) AS Rank
    FROM 
        TableB b
    CROSS JOIN 
        TableA a
    WHERE b.Timestamp <= a.Timestamp

)
SELECT
    Order,
    B_Timestamp,
    A_Timestamp,
    Event
FROM 
    ClosestMatch
WHERE 
    Rank = 1
ORDER BY 
    B_Timestamp;

Do you have any ideas how to improve it?

Upvotes: 0

Views: 39

Answers (2)

jydiw
jydiw

Reputation: 41

drawing inspiration from Guru Stron's answer, you can add the LEAD window function to your event table--assuming events have unique timestamps. then set the join condition so that the timestamps in your order table fall between the timestamps in your events table.

WITH
  table_a ("timestamp", "event") AS (VALUES
    (TIMESTAMP '2024-01-01 10:00', 'Red'),
    (TIMESTAMP '2024-01-01 12:00', 'Blue'),
    (TIMESTAMP '2024-01-01 14:00', 'Pink')
  ),
  table_b ("timestamp", "order") AS (VALUES
    (TIMESTAMP '2024-01-01 11:00', 'Cat'),
    (TIMESTAMP '2024-01-01 11:40', 'Dog'),
    (TIMESTAMP '2024-01-02 09:10', 'Bird')
  ),
  table_a_with_ranges AS (
    SELECT
      *,
      LEAD("timestamp") OVER(ORDER BY "timestamp") AS next_timestamp
    FROM
      table_a
  )
SELECT
  b."order",
  b."timestamp" AS b_timestamp,
  a."timestamp" AS a_timestamp,
  a.event
FROM
  table_b AS b
  JOIN table_a_with_ranges AS a
    ON CASE 
      WHEN a.next_timestamp IS NOT NULL THEN b."timestamp" BETWEEN a."timestamp" AND a.next_timestamp
      ELSE b."timestamp" >= a."timestamp"
        END

Upvotes: 0

Guru Stron
Guru Stron

Reputation: 142833

There is not much you can do I think. There is one dirty hack you can try though. You can "linearize" your two tables into one, order it by the "shared" timestamp and then do some windows functions magic:

-- sample data
WITH table_a(Timestamp, Event) as (
    values (timestamp '2024-01-01 10:00', 'Red'),
    (timestamp '2024-01-01 12:00', 'Blue'),
    (timestamp '2024-01-01 14:00', 'Pink')
),
table_b(Timestamp, "Order") as (
    values (timestamp '2024-01-01 11:00', 'Cat'),
    (timestamp '2024-01-01 11:40', 'Dog'),
    (timestamp '2024-01-02 09:10', 'Bird')
),
        
-- query parts
merged AS (
    select 'event' type,
           null "order",
           table_a.*
    from table_a
    union
    select 'order' type,
            table_b."Order",
           table_b.Timestamp,
           null event
    from table_b
)

select "order",
    timestamp,
    event,
    event_time
from(
    SELECT
        type,
        "order",
        timestamp,
        lead(event) IGNORE NULLS over (order by timestamp) event,
        lead(if(type='event', timestamp)) IGNORE NULLS over (order by timestamp) event_time
    FROM
        merged
)
where type = 'order'

Output (does not match the one from you query, which does not match your desired output as I stated in the comments, so I made a bit of a guess what you need):

order timestamp event event_time
Cat 2024-01-01 11:00:00.000 Blue 2024-01-01 12:00:00.000
Dog 2024-01-01 11:40:00.000 Blue 2024-01-01 12:00:00.000
Bird 2024-01-02 09:10:00.000

You can use lag in the similar way to lead to get the closest event from "other side" (i.e. before) if needed.

Upvotes: 0

Related Questions