GKyle
GKyle

Reputation: 679

Big Query Standard SQL using Partition By with the ARRAY_AGG() function

I am trying to use the PARTITION BY clause with the ARRAY_AGG() function to collapse a column into an array.

My Standard SQL in Big Query is as follows:

        WITH initial_30days
           AS (
          SELECT 
            date,
            fullvisitorId AS user_id,
            visitNumber, 
            CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
          FROM
            `my-data.XXXXXXX.ga_sessions_*`
            WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
            GROUP BY 1,2,3,4
            )

          SELECT
            date,
            ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING 
            AND CURRENT ROW) AS agg_array
          FROM(

          SELECT
            date,
            user_id,
            COUNT(DISTINCT( session_id))  AS sessions
            FROM initial_30days
            GROUP BY date,user_id) 
            GROUP BY date,sessions

My expected output is:

+----------+--------------------------+
|   date   |        agg_array         |
+----------+--------------------------+
| 20181004 | [34,21,34,21,6,7,4,43]   |
| 20181005 | [1,5,56,76,23,1,3,54,45] |
| 20181006 | [22,67,43,1,2,67,3,24]   |
| 20181007 | [34,21,34,21,6,7,4,43]   |
+----------+--------------------------+

My current output looks something like this taking one date value for example:

+----------+------------------------+
|   date   |       agg_array        |
+----------+------------------------+
| 20181004 | [34]                   |
| 20181004 | [34,21]                |
| 20181004 | [34,21,34]             |
| 20181004 | [34,21,34,21]          |
| 20181004 | [34,21,34,21,6]        |
| 20181004 | [34,21,34,21,6,7]      |
| 20181004 | [34,21,34,21,6,7,4]    |
| 20181004 | [34,21,34,21,6,7,4,43] |
+----------+------------------------+

You can see the array partitioned by date creates an incremental row for each value of that array.

The dataset that the ARRAY_AGG() function is applied over looks like:

+----------+------------------+----------+
|   date   |     user_id      | sessions |
+----------+------------------+----------+
| 20181004 | 2526262363754747 |       34 |
| 20181004 | 2525626325173256 |       21 |
| 20181004 | 7436783255747736 |       34 |
| 20181004 | 6526241526363536 |       21 |
| 20181004 | 4252636353637423 |        6 |
| 20181004 | 3636325636673563 |        7 |
+----------+------------------+----------+

I get the feeling its because I'm grouping by sessions above but that is because I get a validation error like so if I don't:

    SELECT list expression references column sessions which is 
neither grouped nor aggregated at 

Upvotes: 1

Views: 11586

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Below is for BigQuery Standard SQL

Just add below around your original query

SELECT date, 
  ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
  ...   
  ...   
)
GROUP BY date   

So, the whole stuff will look like below (and will produce desired result - while preserving your idea of using windowed functions)

#standardSQL
WITH initial_30days AS (
  SELECT 
    date,
    fullvisitorId AS user_id,
    visitNumber, 
    CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
  FROM `my-data.XXXXXXX.ga_sessions_*`
  WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
  GROUP BY 1,2,3,4
)
SELECT date, 
  ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT
    date, 
    ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array
  FROM(
    SELECT
      date,
      user_id,
      COUNT(DISTINCT( session_id))  AS sessions
    FROM initial_30days
    GROUP BY date,user_id
  )
  GROUP BY date,sessions
)
GROUP BY date   

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269463

If you want one row per date, you will need a GROUP BY date:

SELECT date,
       ARRAY_AGG(sessions) AS agg_array
FROM (SELECT date, user_id,
             COUNT(DISTINCT( session_id))  AS sessions
      FROM initial_30days
      GROUP BY date, user_id
     )  du
GROUP BY date;

If you want only a certain number of values, then add LIMIT to the ARRAY_AGG(). For instance, if you wanted 5 sessions for the users with the smallest ids, you can do:

  ARRAY_AGG(sessions ORDER BY user_id LIMIT 5) AS agg_array

Upvotes: 0

Related Questions