SimonB
SimonB

Reputation: 1055

Athena/Presto : complex structure/array

Think that I am asking the impossible here, but throwing it out there. Trying to query some json in Athena. The data I'm working with looks like this (excerpt)

condition={
        "foranyvalue:stringlike":{"s3:prefix":["lala","hehe"]},
        "forallvalues:stringlike":{"s3:prefix":["apples","bananas"]
}

.. and I need to get to here : enter image description here

... PLUS:the key names are not fixed, so one day I might get:

condition={"something not seen before":{"surprise":["haha","hoho"]}}

With that last point, I was hoping to treat this an an array, and start by splitting the 'foranyvalue' and 'forallvalues' parts into separate rows. but with everything wrapped in {}, it refuses to unnest.

But despite the above failed plan - ANY tips on solving this by ANY means gratefully received !

Thank You

Upvotes: 1

Views: 3209

Answers (1)

Theo
Theo

Reputation: 132862

When you have JSON data that does not have a schema that is easy to describe you can use STRING as the type of the column and then use Athena/Presto's JSON functions to query them, in combination with casting to MAP and UNNEST to flatten the structures.

One way of achieving what I think you're trying to do would be something like this:

WITH the_table AS (
  SELECT CAST(condition AS MAP(VARCHAR, JSON)) AS condition
  FROM (
    VALUES
    (JSON '{"foranyvalue:stringlike":{"s3:prefix":["lala","hehe"]},"forallvalues:stringlike":{"s3:prefix":["apples","bananas"]}}'),
    (JSON '{"something not seen before":{"surprise":["haha","hoho"]}}')
  ) AS t (condition)
),
first_flattening AS (
  SELECT 
    SPLIT(first_level_key, ':', 2) AS first_level_key,
    CAST(first_level_value AS MAP(VARCHAR, JSON)) AS first_level_value
  FROM the_table
  CROSS JOIN UNNEST (condition) AS t (first_level_key, first_level_value)
),
second_flattening AS (
  SELECT
    first_level_key,
    second_level_key,
    second_level_value
  FROM first_flattening
  CROSS JOIN UNNEST (first_level_value) AS t (second_level_key, second_level_value)
)
SELECT
 first_level_key[1] AS "for",
 TRY(first_level_key[2]) AS condition,
 second_level_key AS "left",
 second_level_value AS "right"
FROM second_flattening

I've included the two examples you gave as an inline VALUES list in the first CTE, and exactly what to do in the table declaration (i.e. what type for the column to use) and what processing to do in the query (i.e. the cast) depends on your data and how you want/can set up the table. YMMV.

The query flattens the JSON structure in a couple of separate steps, first flattening the first level of keys and values, then the keys and values of the inner documents. It might be possible to do this in one step, but doing it in two at least makes it easier to read.

Since the first level keys don't always have the colon I've used TRY to make sure that accessing the second value doesn't break anything. You could perhaps filter out values without a colon earlier and avoid this, since you're not interested in them.

Upvotes: 2

Related Questions