Naxi
Naxi

Reputation: 2026

How to work with json arrays in AWS athena

I am new to athena and am currently working on reading files from s3 and presenting them as table.

Below is how individual files look like in s3 buckets.

sample.json

[
    {
    "name":"John",
    "age":29,
    "salary":300
    },
    {
    "name":"Mary",
    "age":25,
    "salary":500
    }
]

I tried creating a table with this data but I am not able to get it in the desire format. I would like to get each dictionary to take up one row. The table should have three columns (Name, Age, Salary).

Tried below but it's putting entire content under data column as it is.

CREATE EXTERNAL TABLE IF NOT EXISTS test(
                   `data` string
                 )
                 LOCATION 's3://somelocation/'
                 TBLPROPERTIES ('has_encrypted_data'='false');

output

data
-----
[{"name":"John",...},{{"name":"Mary",...}]

Referred this but didn't reach anywhere.

Is there an ideal way to do this ? FYI, the number of dict present in each file may vary and is not at all fixed. So I am looking for a dynamic solution which works good if there is just one dict and if there are 10 in each file in s3.

Upvotes: 1

Views: 6172

Answers (2)

kylejmcintyre
kylejmcintyre

Reputation: 2006

If you create an Athena table based on the Json SerDe and you want a single s3 object to contain multiple rows/records inside of it, the expectation is that each row/record is on its own line in the file, and there there is no outer JSON array wrapping all of the records.

If you want the table to have three columns for name, age and salary you'll need to declare those columns in your table DDL. The Json SerDe will find the JSON props accordingly, if you restructure your files like I described above.

If you can't restructure your files easily to have a record per line, then you may want to look into using the Hive ARRAY type for the column you're currently calling data, and then use Presto's UNNEST operator to flatten those array entries out into a table. You'd probably want to declare data as an array<struct<>> where the struct has the name, age & salary props inside it. But I think you'll have an easier time if you can avoid the extra layer of nesting.

Upvotes: 4

Guy
Guy

Reputation: 12901

You have two options:

  • Use AWS Glue with custom classifier using the JSON path as $[*]
  • Keep the record as long varchar (string) and parse it on the fly with queries using JSON functions. For example to get the name of the first record:
SELECT json_extract('[
    {
      "name":"John",
      "age":29,
      "salary":300
    },
    {
      "name":"Mary",
      "age":25,
      "salary":500
    }
]','$[0].name');

or to get the last record:

SELECT json_array_get('[
    {
      "name":"John",
      "age":29,
      "salary":300
    },
    {
      "name":"Mary",
      "age":25,
      "salary":500
    }
]',-1);

Upvotes: 2

Related Questions