ScottP
ScottP

Reputation: 187

Using ARRAY_AGG() with DISTINCT and ORDER BY with ORDINAL

I have some data that I am trying to aggregate (greatly simplified here). The raw data uses a schema similar to the following:

UserID - STRING
A - RECORD REPEATED
A.Action - STRING
A.Visit - INTEGER
A.Order - INTEGER
MISC - RECORD REPEATED
( other columns omitted here )

There are many actual records due to the "MISC" column, but I'm only trying to focus on the first 5 columns shown above. A sample of the raw data is shown below (note that the values shown are a sample only, many other values exist so these cannot be hard coded into the query) :

Table 0: (Raw data sample)

(empty values under UserID are as shown in BiqQuery - "A" fields are part of a nested record)

Table0

My query produces the data shown in Table 1 below. I am trying to use ARRAY_AGG with ORDINAL to select only the first two "Action"s for each user and restructure as shown in TABLE 2.

SELECT
  UserId, ARRAY_AGG( STRUCT(A.Action, A.Visit, A.Order)
          ORDER BY A.Visit, A.Order, A.Action ) 
  FROM
    `table` 
  LEFT JOIN UNNEST(A) AS A
GROUP BY
    UserId

Table 1: (Sample output of above query )

Table1

Table 2: (The format needed)

Table2

So I need to:

  1. Get distinct "Action" values for each user
  2. Preserve the order ( UserID, Visit, Order )
  3. Show only the 1st and 2nd actions in one row

My attempted query strategy was to ORDER BY UserID, Visit, Order and get DISTINCT values of Action using something like:

UserId,
ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) FirstAction,
ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) SecondAction

However, that approach produces the following error:

Error: An aggregate function that has both DISTINCT and ORDER BY arguments can only ORDER BY columns that are arguments to the function

Any thoughts on how to correct this error (or an alternative approach?)

Upvotes: 12

Views: 37675

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59225

Not sure why the original query has DISTINCT, if the results shown in table 2 don't need de-duplication.

With that said:

#standardSQL
WITH sample AS (
  SELECT actor.login userid, type action
    , EXTRACT(HOUR FROM created_at) visit
    , EXTRACT(MINUTE FROM created_at) `order`
  FROM `githubarchive.day.20171005` 
)

SELECT userid, actions[OFFSET(0)] firstaction, actions[SAFE_OFFSET(1)] secondaction
FROM (
  SELECT userid, ARRAY_AGG(action ORDER BY visit, `order` LIMIT 2) actions
  FROM sample
  GROUP BY 1
  ORDER BY 1
  LIMIT 100
)

enter image description here

Upvotes: 10

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046


Try below.

#standardSQL
SELECT UserId, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(1)] AS FirstAction, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(2)] AS SecondAction 
FROM `project.dataset.table`
LEFT JOIN UNNEST(A) AS A 
GROUP BY UserId
-- ORDER BY UserId

You can test / play with it using dummy data from your question

#standardSQL
WITH `table` AS (
  SELECT 'U001' AS UserId, [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
    ('Register', 1, 1),('Upgrade', 1, 2),('Feedback', 1, 3),('Share', 1, 4),('Share', 2, 1)] AS A UNION ALL
  SELECT 'U002', [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
    ('Share', 7, 1),('Share', 7, 2),('Refer', 8, 1),('Feedback', 8, 2),('Feedback', 8, 3)] UNION ALL
  SELECT 'U003', [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
    ('Register', 1, 1),('Share', 1, 2),('Share', 1, 3),('Share', 2, 1),('Share', 2, 2),('Share', 3, 1),('Share', 3, 2)] 
)
SELECT UserId, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(1)] AS FirstAction, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(2)] AS SecondAction 
FROM `table`
LEFT JOIN UNNEST(A) AS A 
GROUP BY UserId
ORDER BY UserId

Upvotes: 3

Related Questions