Reputation: 23
I have a table with playid, userid, created, and stopped. I want to merge certain rows by keeping the earliest created and the latest stopped within 120 minutes from the earliest created.
See the following table:
playid | userid | created | stopped |
---|---|---|---|
1 | a01 | 2021-01-30 02:29:58.559858 | 2021-01-30 02:40:56.558436 |
2 | a01 | 2021-01-30 02:41:24.023358 | 2021-01-30 02:55:24.112713 |
3 | a01 | 2021-01-30 02:57:30.178579 | 2021-01-30 03:11:14.866678 |
4 | a01 | 2021-01-30 03:11:41.098424 | 2021-01-30 03:22:50.155918 |
5 | a01 | 2021-01-30 03:23:20.545288 | 2021-01-30 03:36:37.027486 |
6 | a01 | 2021-01-30 03:46:10.237971 | 2021-01-30 03:59:17.526151 |
7 | a01 | 2021-01-30 03:59:57.020326 | 2021-01-30 04:16:25.948311 |
8 | a01 | 2021-01-30 04:26:28.968568 | 2021-01-30 04:30:11.863079 |
9 | a01 | 2021-01-30 04:30:14.43786 | 2021-01-30 04:41:42.231915 |
10 | a01 | 2021-01-30 04:43:17.669945 | 2021-01-30 04:45:10.443101 |
11 | a01 | 2021-01-30 04:45:14.50346 | 2021-01-30 04:47:07.082114 |
12 | a01 | 2021-01-30 04:47:10.334574 | 2021-01-30 04:49:30.96017 |
13 | a01 | 2021-01-30 04:49:34.146011 | 2021-01-30 04:50:54.45988 |
14 | a01 | 2021-01-30 04:50:57.948305 | 2021-01-30 05:11:01.246284 |
15 | a01 | 2021-01-30 05:39:29.387396 | 2021-01-30 05:41:39.508654 |
16 | a01 | 2021-01-30 05:41:44.524951 | 2021-01-30 05:43:38.231266 |
17 | a01 | 2021-01-30 05:43:40.785809 | 2021-01-30 05:54:40.711381 |
18 | a01 | 2021-01-30 05:55:10.851725 | 2021-01-30 05:58:24.262351 |
19 | a01 | 2021-01-30 05:58:29.43821 | 2021-01-30 06:00:50.870644 |
20 | a01 | 2021-01-30 06:00:54.168696 | 2021-01-30 06:12:37.210234 |
21 | a01 | 2021-01-30 14:41:01.529666 | 2021-01-30 14:53:01.349479 |
22 | a01 | 2021-01-30 14:53:48.053136 | 2021-01-30 15:05:09.532786 |
23 | a01 | 2021-01-31 15:11:08.547111 | 2021-01-31 15:11:37.564802 |
24 | a01 | 2021-01-31 15:11:40.248499 | 2021-01-31 15:13:40.21787 |
25 | a01 | 2021-01-31 15:13:59.373145 | 2021-01-31 15:31:54.099898 |
26 | a01 | 2021-01-31 15:32:23.20448 | 2021-01-31 15:46:33.993751 |
27 | a01 | 2021-01-31 16:55:19.141051 | 2021-01-31 17:03:43.464444 |
I want the query to give this table (see below). Please note that the next created is after the previous stopped, meaning the created with playid "8" is after stopped with playid "1":
playid | userid | created | stopped |
---|---|---|---|
1 | a01 | 2021-01-30 02:29:58.559858 | 2021-01-30 04:16:25.948311 |
8 | a01 | 2021-01-30 04:26:28.968568 | 2021-01-30 06:12:37.210234 |
21 | a01 | 2021-01-30 14:41:01.529666 | 2021-01-30 15:05:09.532786 |
23 | a01 | 2021-01-31 15:11:08.547111 | 2021-01-31 17:03:43.464444 |
Here is the query I have tried:
SELECT
MIN(t1.playid) as playid,
t1.userid,
SAFE_CAST(t1.created AS DATETIME) as created,
MAX(SAFE_CAST(t2.stopped AS DATETIME)) as stopped
FROM
MyTable as t1
LEFT OUTER JOIN
MyTable as t2
ON t1.userid = t2.userid
WHERE
DATETIME_DIFF(SAFE_CAST(t2.stopped AS DATETIME),SAFE_CAST(t1.created AS DATETIME),MINUTE)
< 120
GROUP BY
t1.playid, t1.userid, t1.created
and querying the previous table using this query:
SELECT
MIN(playid) AS playid,
userid,
MIN(created) AS created,
stopped,
FROM
MyTable2
GROUP BY
userid
stopped
This is the (wrong) table I get when I run the first query:
playid | userid | created | stopped |
---|---|---|---|
1 | a01 | 2021-01-30 02:29:58.559858 | 2021-01-30 04:16:25.948311 |
2 | a01 | 2021-01-30 02:41:24.023358 | 2021-01-30 04:30:11.863079 |
3 | a01 | 2021-01-30 02:57:30.178579 | 2021-01-30 04:50:54.45988 |
4 | a01 | 2021-01-30 03:11:41.098424 | 2021-01-30 05:11:01.246284 |
5 | a01 | 2021-01-30 03:23:20.545288 | 2021-01-30 05:11:01.246284 |
6 | a01 | 2021-01-30 03:46:10.237971 | 2021-01-30 05:43:38.231266 |
7 | a01 | 2021-01-30 03:59:57.020326 | 2021-01-30 05:58:24.262351 |
8 | a01 | 2021-01-30 04:26:28.968568 | 2021-01-30 06:12:37.210234 |
9 | a01 | 2021-01-30 04:30:14.43786 | 2021-01-30 06:12:37.210234 |
10 | a01 | 2021-01-30 04:43:17.669945 | 2021-01-30 06:12:37.210234 |
11 | a01 | 2021-01-30 04:45:14.50346 | 2021-01-30 06:12:37.210234 |
12 | a01 | 2021-01-30 04:47:10.334574 | 2021-01-30 06:12:37.210234 |
13 | a01 | 2021-01-30 04:49:34.146011 | 2021-01-30 06:12:37.210234 |
14 | a01 | 2021-01-30 04:50:57.948305 | 2021-01-30 06:12:37.210234 |
15 | a01 | 2021-01-30 05:39:29.387396 | 2021-01-30 06:12:37.210234 |
16 | a01 | 2021-01-30 05:41:44.524951 | 2021-01-30 06:12:37.210234 |
17 | a01 | 2021-01-30 05:43:40.785809 | 2021-01-30 06:12:37.210234 |
18 | a01 | 2021-01-30 05:55:10.851725 | 2021-01-30 06:12:37.210234 |
19 | a01 | 2021-01-30 05:58:29.43821 | 2021-01-30 06:12:37.210234 |
20 | a01 | 2021-01-30 06:00:54.168696 | 2021-01-30 06:12:37.210234 |
21 | a01 | 2021-01-30 14:41:01.529666 | 2021-01-30 15:05:09.532786 |
22 | a01 | 2021-01-30 14:53:48.053136 | 2021-01-30 15:05:09.532786 |
23 | a01 | 2021-01-31 15:11:08.547111 | 2021-01-31 17:03:43.464444 |
24 | a01 | 2021-01-31 15:11:40.248499 | 2021-01-31 17:03:43.464444 |
25 | a01 | 2021-01-31 15:13:59.373145 | 2021-01-31 17:03:43.464444 |
26 | a01 | 2021-01-31 15:32:23.20448 | 2021-01-31 17:03:43.464444 |
27 | a01 | 2021-01-31 16:55:19.141051 | 2021-01-31 17:03:43.464444 |
The problem here is that the column created in the next row does not jump to the next playid with a created value which is larger/later than the stopped value of the current playid. Instead it just iterates through the table. Please see the tables above and use the playid column to make it easier to understand.
PS: I am very new to SQL. Please help me optimize this if this is very ineffective. Would be best to only have one query, of course.
Thank you!
Upvotes: 2
Views: 118
Reputation: 10232
This needs loops/recursion which is not supported natively in BigQuery SQL. Try user-defined functions, but they are not very scalable:
CREATE TEMP FUNCTION special_merge(x ARRAY<STRUCT<playid INT64, created_at TIMESTAMP, stopped TIMESTAMP>>)
RETURNS ARRAY<STRUCT<playid INT64, created_at TIMESTAMP, stopped TIMESTAMP>>
LANGUAGE js
AS """
var intervalStart = x[0].created_at;
var intervalId = x[0].playid;
var intervalEnd;
var result = [];
for (var row of x)
{
if (row.stopped - intervalStart > 120*60*1000 && intervalEnd) {
result.push({playid: intervalId, created_at:intervalStart, stopped:intervalEnd});
intervalStart = row.created_at;
intervalId = row.playid
}
intervalEnd = row.stopped;
}
result.push({playid: intervalId, created_at:intervalStart, stopped:intervalEnd});
return result;
""";
with mytable as (
select 1 as playid, 'a01' as userid, timestamp '2021-01-30 02:29:58.559858' as created_at, timestamp '2021-01-30 02:40:56.558436' as stopped union all
select 2, 'a01', timestamp '2021-01-30 02:41:24.023358', timestamp '2021-01-30 02:55:24.112713' union all
select 7, 'a01', timestamp '2021-01-30 03:59:57.020326', timestamp '2021-01-30 04:16:25.948311' union all
select 8, 'a01', timestamp '2021-01-30 04:26:28.968568', timestamp '2021-01-30 04:30:11.863079' union all
select 20, 'a01', timestamp '2021-01-30 06:00:54.168696', timestamp '2021-01-30 06:12:37.210234' union all
select 21, 'a01', timestamp '2021-01-30 14:41:01.529666', timestamp '2021-01-30 14:53:01.349479' union all
select 22, 'a01', timestamp '2021-01-30 14:53:48.053136', timestamp '2021-01-30 15:05:09.532786' union all
select 23, 'a01', timestamp '2021-01-31 15:11:08.547111', timestamp '2021-01-31 15:11:37.564802' union all
select 27, 'a01', timestamp '2021-01-31 16:55:19.141051', timestamp '2021-01-31 17:03:43.464444'
)
select userid, m.*
from (
select userid, special_merge(array_agg(STRUCT(playid, created_at, stopped) order by created_at)) as merged
from mytable
group by userid
), unnest(merged) as m
order by userid, m.playid
Upvotes: 1