TheDataGuy
TheDataGuy

Reputation: 3108

Does AWS Athena supports Order by in Array_AGG?

Im working with AWS Athena to concat a few rows to a single row.

Example table:(name: unload)

xid pid sequence    text
1   1   0           select * from
1   1   1           mytbl
1   1   2   
2   1   0           update test
2   1   1            set mycol=
2   1   2           'a';

So want to contact the text column.

Expected Output:

xid pid text
1   1   select * from mytbl
2   1   update test set mycol='a';

I ran the following query to partition it first with proper order and do the concat.

with cte as
    (SELECT  
         xid,
        pid,

         sequence,
         text,
        row_number()
        OVER (PARTITION BY  xid,pid
    ORDER BY  sequence) AS rank
    FROM unload 
    GROUP BY   xid,pid,sequence,text
    )
SELECT 
         xid,
        pid,
        array_join(array_agg(text),'') as text
FROM cte
GROUP BY   xid,pid

But if you see the below output the order got misplaced.

xid pid text
1   1   mytblselect * from
2   1   update test'a'; set mycol=

I checked the Presto documentation, the latest version supports order by in array agg, but Athena is using Presto 0.172, so Im not sure whether it is supported or not.

What is the workaround for this in Athena?

Upvotes: 4

Views: 3088

Answers (1)

OurNewestMember
OurNewestMember

Reputation: 41

One approach:

  1. create records with a sortable format of text
  2. aggregate into an unsorted array
  3. sort the array
  4. transform each element back into the original value of text
  5. convert the sorted array to a string output column
WITH cte AS (
SELECT
xid, pid, text
-- create a sortable 19-digit ranking string
, SUBSTR(
  LPAD(
    CAST(
      ROW_NUMBER() OVER (PARTITION BY xid, pid ORDER BY sequence)
      AS VARCHAR)
    , 19
    , '0')
  , -19) AS SEQ_STR
FROM unload
)
SELECT
xid, pid
-- make sortable string, aggregate into array
-- then sort array, revert each element to original text
-- finally combine array elements into one string
, ARRAY_JOIN(
  TRANSFORM(
    ARRAY_SORT(
      ARRAY_AGG(SEQ_STR || text))
    , combined -> SUBSTR(combined, 1 + 19))
  , ' '
  , '') AS TEXT
FROM cte
GROUP BY xid, pid
ORDER BY xid, pid

This code assumes:

  1. xid + pid + sequence is unique for all input records
  2. There are not many combinations of xid + pid + sequence (eg, not more than 20 million)

Upvotes: 2

Related Questions