Reputation: 23
this is my goal, split the dates into groups of 24 hours, but those depends of the data, not of a very specific date
Let's say these are the calls from a call center, I want to know how many sessions I had, but those sessions are valid for 24 hours, those 24 hours are starting to count since the first date_sent, if the next call is later of those first 24 hours, a new session would be created
The expected results are like this: First 3 columns are those I already have in the table, fourth one is the required calculation
identifier customer_id date_sent StartOfSession sessionId
456456150 5366 2020-09-01T10:17:48.360000 2020-09-01T10:17:48.360000 1
456456150 5366 2020-09-01T18:24:45.552000 2020-09-01T10:17:48.360000 1
456456150 5366 2020-09-02T10:20:46.283000 2020-09-02T10:20:46.283000 2
456456150 5366 2020-09-02T18:25:01.911000 2020-09-02T10:20:46.283000 2
456456150 5366 2020-09-03T10:20:38.407000 2020-09-02T10:20:46.283000 2
456456150 5366 2020-09-03T18:23:35.915000 2020-09-03T18:23:35.915000 3
456456150 5366 2020-09-04T10:19:46.474000 2020-09-03T18:23:35.915000 3
456456150 5366 2020-09-04T14:22:17.236000 2020-09-03T18:23:35.915000 3
456456150 5366 2020-09-04T18:24:33.155000 2020-09-04T18:24:33.155000 4
456456150 5366 2020-09-05T10:19:48.871000 2020-09-04T18:24:33.155000 4
456456150 5366 2020-09-05T18:25:07.968000 2020-09-05T18:25:07.968000 5
456456150 5366 2020-09-06T10:19:34.808000 2020-09-05T18:25:07.968000 5
456456150 5366 2020-09-06T18:26:17.705000 2020-09-06T18:26:17.705000 6
456456150 5366 2020-09-07T10:21:28.585000 2020-09-06T18:26:17.705000 6
456456150 5366 2020-09-07T18:24:17.123000 2020-09-06T18:26:17.705000 6
456456150 5366 2020-09-08T10:20:09.850000 2020-09-08T10:20:09.850000 7
456456150 5366 2020-09-08T18:24:32.733000 2020-09-08T10:20:09.850000 7
456456150 5366 2020-09-09T10:20:05.336000 2020-09-08T10:20:09.850000 7
456456150 5366 2020-09-09T12:12:41.137000 2020-09-09T12:12:41.137000 8
456456150 5366 2020-09-09T18:24:25.783000 2020-09-09T12:12:41.137000 8
I've tried using window functions, but I cannot achieve the same expected results:
SELECT identifier, customer_id, date_sent,
FIRST_VALUE(date_sent) OVER (PARTITION BY A.identifier, A.customer_id, CAST(A.date_sent AS DATE) ORDER BY UNIX_SECONDS(TIMESTAMP(date_sent)) RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) FirstV_date1
FROM `sandbox.testing` A
WHERE identifier = '456456150'
AND date_sent between '2020-09-01' AND '2020-09-10'
Those would be my actual results
identifier customer_id date FirstV_date1
456456150 5366 2020-09-01T10:17:48.360000 2020-09-01T10:17:48.360000
456456150 5366 2020-09-01T18:24:45.552000 2020-09-01T10:17:48.360000
456456150 5366 2020-09-02T10:20:46.283000 2020-09-02T10:20:46.283000
456456150 5366 2020-09-02T18:25:01.911000 2020-09-02T10:20:46.283000
456456150 5366 2020-09-03T10:20:38.407000 2020-09-03T10:20:38.407000
456456150 5366 2020-09-03T18:23:35.915000 2020-09-03T10:20:38.407000
456456150 5366 2020-09-04T10:19:46.474000 2020-09-04T10:19:46.474000
456456150 5366 2020-09-04T14:22:17.236000 2020-09-04T10:19:46.474000
456456150 5366 2020-09-04T18:24:33.155000 2020-09-04T10:19:46.474000
456456150 5366 2020-09-05T10:19:48.871000 2020-09-05T10:19:48.871000
456456150 5366 2020-09-05T18:25:07.968000 2020-09-05T10:19:48.871000
456456150 5366 2020-09-06T10:19:34.808000 2020-09-06T10:19:34.808000
456456150 5366 2020-09-06T18:26:17.705000 2020-09-06T10:19:34.808000
456456150 5366 2020-09-07T10:21:28.585000 2020-09-07T10:21:28.585000
456456150 5366 2020-09-07T18:24:17.123000 2020-09-07T10:21:28.585000
456456150 5366 2020-09-08T10:20:09.850000 2020-09-08T10:20:09.850000
456456150 5366 2020-09-08T18:24:32.733000 2020-09-08T10:20:09.850000
456456150 5366 2020-09-09T10:20:05.336000 2020-09-09T10:20:05.336000
456456150 5366 2020-09-09T12:12:41.137000 2020-09-09T10:20:05.336000
456456150 5366 2020-09-09T18:24:25.783000 2020-09-09T10:20:05.336000
I've also tried using self join, but I rather not because that's very expensive, but, any ideas would be welcome.
Thanks in advance!
Upvotes: 2
Views: 277
Reputation: 172993
Below is for BigQuery Standard SQL (assumes that date_sent column is of timestamp data type - as it looks like in provided example)
#standardSQL
create temp function get_sessions(x array<timestamp>)
returns array<struct<date_sent timestamp, sessionStart timestamp, session string>>
language js as """
output = []; session = 1; sessionStart = x[0]; total_dur = 0;
a = {}; a.date_sent = x[0]; a.session = session; a.sessionStart = sessionStart;
output.push(a);
for(i = 1; i < x.length; i++){
a = {};
total_dur += x[i].getTime() - x[i-1].getTime();
if(total_dur>24*3600*1000){
total_dur = 0; session++; sessionStart = x[i];
};
a.date_sent = x[i-1]; a.sessionStart = sessionStart; a.session = session;
output.push(a);
}
return output;
""";
select identifier, customer_id, date_sent, sessionStart, session
from (
select identifier, customer_id, get_sessions(array_agg(date_sent order by date_sent)) sessions
from `project.dataset.table`
group by identifier, customer_id
), unnest(sessions)
if to apply to sample data in the question - output is
Important: assumption here - based on your comments - volume of rows per partition (identifier, customer_id) is relatively small (~2K) so js udf memory limit is not a problem here)
Upvotes: 1