Reputation: 679
I am trying to use the PARTITION BY
clause with the ARRAY_AGG()
function to collapse a column into an array.
My Standard SQL in Big Query is as follows:
WITH initial_30days
AS (
SELECT
date,
fullvisitorId AS user_id,
visitNumber,
CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
FROM
`my-data.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103'
GROUP BY 1,2,3,4
)
SELECT
date,
ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING
AND CURRENT ROW) AS agg_array
FROM(
SELECT
date,
user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date,user_id)
GROUP BY date,sessions
My expected output is:
+----------+--------------------------+
| date | agg_array |
+----------+--------------------------+
| 20181004 | [34,21,34,21,6,7,4,43] |
| 20181005 | [1,5,56,76,23,1,3,54,45] |
| 20181006 | [22,67,43,1,2,67,3,24] |
| 20181007 | [34,21,34,21,6,7,4,43] |
+----------+--------------------------+
My current output looks something like this taking one date value for example:
+----------+------------------------+
| date | agg_array |
+----------+------------------------+
| 20181004 | [34] |
| 20181004 | [34,21] |
| 20181004 | [34,21,34] |
| 20181004 | [34,21,34,21] |
| 20181004 | [34,21,34,21,6] |
| 20181004 | [34,21,34,21,6,7] |
| 20181004 | [34,21,34,21,6,7,4] |
| 20181004 | [34,21,34,21,6,7,4,43] |
+----------+------------------------+
You can see the array partitioned by date creates an incremental row for each value of that array.
The dataset that the ARRAY_AGG()
function is applied over looks like:
+----------+------------------+----------+
| date | user_id | sessions |
+----------+------------------+----------+
| 20181004 | 2526262363754747 | 34 |
| 20181004 | 2525626325173256 | 21 |
| 20181004 | 7436783255747736 | 34 |
| 20181004 | 6526241526363536 | 21 |
| 20181004 | 4252636353637423 | 6 |
| 20181004 | 3636325636673563 | 7 |
+----------+------------------+----------+
I get the feeling its because I'm grouping by sessions
above but that is because I get a validation error like so if I don't:
SELECT list expression references column sessions which is
neither grouped nor aggregated at
Upvotes: 1
Views: 11586
Reputation: 172944
Below is for BigQuery Standard SQL
Just add below around your original query
SELECT date,
ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
...
...
)
GROUP BY date
So, the whole stuff will look like below (and will produce desired result - while preserving your idea of using windowed functions)
#standardSQL
WITH initial_30days AS (
SELECT
date,
fullvisitorId AS user_id,
visitNumber,
CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
FROM `my-data.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103'
GROUP BY 1,2,3,4
)
SELECT date,
ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT
date,
ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array
FROM(
SELECT
date,
user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date,user_id
)
GROUP BY date,sessions
)
GROUP BY date
Upvotes: 1
Reputation: 1269463
If you want one row per date, you will need a GROUP BY date
:
SELECT date,
ARRAY_AGG(sessions) AS agg_array
FROM (SELECT date, user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date, user_id
) du
GROUP BY date;
If you want only a certain number of values, then add LIMIT
to the ARRAY_AGG()
. For instance, if you wanted 5 sessions for the users with the smallest ids, you can do:
ARRAY_AGG(sessions ORDER BY user_id LIMIT 5) AS agg_array
Upvotes: 0