Reputation: 149
I have a table with about 700 million rows, which have the below sample for only one line_id.
LINE_ID|COLLECTION_DATE |DSL_CARD_TYPE|
-------|-------------------|-------------|
1234567|2020-03-25 08:46:08|ADSL_PORT |
1234567|2020-03-26 08:31:48|ADSL_PORT |
1234567|2020-03-27 08:42:40|VDSL_PORT |
1234567|2020-03-28 08:36:32|VDSL_PORT |
1234567|2020-03-29 08:31:33|VDSL_PORT |
1234567|2020-03-30 08:50:15|VDSL_PORT |
1234567|2020-04-31 08:44:33|ADSL_PORT |
1234567|2020-03-01 08:34:53|ADSL_PORT |
1234567|2020-04-02 08:44:11|ADSL_PORT |
1234567|2020-04-03 08:43:51|VDSL_PORT |
1234567|2020-04-04 08:54:33|ADSL_PORT |
1234567|2020-04-05 09:06:47|ADSL_PORT |
1234567|2020-04-06 09:06:57|VDSL_PORT |
1234567|2020-04-07 09:13:32|VDSL_PORT |
What I need is to group DSL_CARD_TYPE
and create a new column called Next_COLLECTION_DATE
to get the next DSL_CARD_TYPE like below
LINE_ID|COLLECTION_DATE |Next_COLLECTION_DATE |DSL_CARD_TYPE|
-------|-------------------|----------------------|-------------|
1234567|2020-03-25 08:46:08|2020-03-26 08:31:48 |ADSL_PORT |
1234567|2020-03-27 08:42:40|2020-03-30 08:50:15 |VDSL_PORT |
1234567|2020-03-31 08:34:53|2020-04-02 08:44:11 |ADSL_PORT |
1234567|2020-04-03 08:43:51|2020-04-03 08:43:51 |VDSL_PORT |
1234567|2020-04-04 08:54:33|2020-04-05 09:06:47 |ADSL_PORT |
1234567|2020-04-06 09:06:57|2020-04-07 09:13:32 |VDSL_PORT |
I have created a very dummy and complex query to do the job, but with this huge amount of data it takes hours
COALESCE (lead (COLLECTION_DATE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE),NOW() )Next_Collection_Date,
DSL_CARD_TYPE
FROM (
SELECT * FROM (
SELECT
LINE_ID, COLLECTION_DATE,
DSL_CARD_TYPE ,
lead (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) To_Sync_Port,
lag (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) B_Sync_Port
FROM
ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL
WHERE SYNC_PORT <> TO_SYNC_PORT OR B_Sync_Port IS NULL )abc2```
Upvotes: 1
Views: 1604
Reputation: 1270463
This looks like a gaps-and-islands problem, which in this case is probably best solved using the difference of row numbers:
select line_id, dsl_card_type, min(collection_date), max(collection_date)
from (select v.*,
row_number() over (partition by line_id order by collection_date) as seqnum,
row_number() over (partition by line_id, dsl_card_type order by collection_date) as seqnum_2
from ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL v
where collection_date >= '2020-07-27 00:00:00'
) v
group by line_id, dsl_card_type, (seqnum - seqnum_2);
It is a little tricky to explain how this works. If you run the subquery, you can see how the difference between the two row numbers defines the adjacent rows with the same card type.
Upvotes: 1