sn3fru
sn3fru

Reputation: 106

How to tranform SELECT sql query results in an array in Athena

This is a very general need and can help many people. It appears whenever we need to group information and summarize the grouped values as a string as follows:

person Visit Date Visit
First 2020-01-01 1
First 2020-01-02 0
First 2020-01-03 1
First 2020-01-04 1
First 2020-01-05 0
Second 2020-01-01 0
Second 2020-01-02 1
Second 2020-01-03 1
Second 2020-01-04 0
Second 2020-01-05 1

Grouping values by the person column:

person MIN(Visit Date) Sequence of Visits
First 2020-01-01 10110
Second 2020-01-01 01101

I am using Athena for this solution so I am not as free to write a procedural function.

Upvotes: 1

Views: 1721

Answers (2)

Guy
Guy

Reputation: 12929

In the general case, you can use the combination of reduce and array_agg as follows:

WITH example_table AS
 (SELECT 'First' as person, '2020-01-01' as _date, 1 as _value UNION ALL
  SELECT 'First', '2020-01-02', 0 UNION ALL
  SELECT 'First', '2020-01-03', 1 UNION ALL
  SELECT 'Second', '2020-01-01', 1 UNION ALL
  SELECT 'Second', '2020-01-02', 0)
SELECT person, min(_date), reduce(array_agg(_value ORDER BY _date), '', (s, x) -> CAST(s as varchar) || CAST(x as varchar), s -> s)
FROM example_table
GROUP BY person

The array_agg(_value ORDER BY _date) will create ARRAY [1,0,1], for example, for the first person. The reduce part is concatenating the varchar of the numbers in the above array.

For the specific output of varchar, you can also use array_join, which makes it even shorter:

WITH example_table AS
 (SELECT 'First' as person, '2020-01-01' as _date, 1 as _value UNION ALL
  SELECT 'First', '2020-01-02', 0 UNION ALL
  SELECT 'First', '2020-01-03', 1 UNION ALL
  SELECT 'Second', '2020-01-01', 1 UNION ALL
  SELECT 'Second', '2020-01-02', 0)
SELECT person, min(_date), array_join(array_agg(_value ORDER BY _date),'')
FROM example_table
GROUP BY person

BTW, try to avoid using generic names such as value or date in your column name. You will most likely hit some reserved words in one language or another, and it is hard to understand the meaning of your logic.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270431

In Athena, you would use array_agg() for this. But unfortunately, it comes with this very loud warning:

ORDER BY is not supported for aggregation functions, for example, you cannot use it within array_agg(x).

And I assume that ordering is important.

If you know that maximum number of values for each group, you can use conditional aggregation like this:

select person, min(date),
       ( max(case when seqnum = 1 then value else '' end) ||
         max(case when seqnum = 2 then value else '' end) ||
         max(case when seqnum = 3 then value else '' end) ||
         max(case when seqnum = 4 then value else '' end) ||
         max(case when seqnum = 5 then value else '' end)
       ) as values_5
from (select t.*,
             row_number() over (partition by person order by date) as seqnum
      from t
     ) t
group by person;

Upvotes: 1

Related Questions