Reputation: 2026
I am new to athena and am currently working on reading files from s3 and presenting them as table.
Below is how individual files look like in s3 buckets.
sample.json
[
{
"name":"John",
"age":29,
"salary":300
},
{
"name":"Mary",
"age":25,
"salary":500
}
]
I tried creating a table with this data but I am not able to get it in the desire format. I would like to get each dictionary to take up one row. The table should have three columns (Name, Age, Salary).
Tried below but it's putting entire content under data column as it is.
CREATE EXTERNAL TABLE IF NOT EXISTS test(
`data` string
)
LOCATION 's3://somelocation/'
TBLPROPERTIES ('has_encrypted_data'='false');
output
data
-----
[{"name":"John",...},{{"name":"Mary",...}]
Referred this but didn't reach anywhere.
Is there an ideal way to do this ? FYI, the number of dict present in each file may vary and is not at all fixed. So I am looking for a dynamic solution which works good if there is just one dict and if there are 10 in each file in s3.
Upvotes: 1
Views: 6172
Reputation: 2006
If you create an Athena table based on the Json SerDe and you want a single s3 object to contain multiple rows/records inside of it, the expectation is that each row/record is on its own line in the file, and there there is no outer JSON array wrapping all of the records.
If you want the table to have three columns for name, age and salary you'll need to declare those columns in your table DDL. The Json SerDe will find the JSON props accordingly, if you restructure your files like I described above.
If you can't restructure your files easily to have a record per line, then you may want to look into using the Hive ARRAY type for the column you're currently calling data
, and then use Presto's UNNEST operator to flatten those array entries out into a table. You'd probably want to declare data as an array<struct<>>
where the struct has the name, age & salary props inside it. But I think you'll have an easier time if you can avoid the extra layer of nesting.
Upvotes: 4
Reputation: 12901
You have two options:
$[*]
SELECT json_extract('[
{
"name":"John",
"age":29,
"salary":300
},
{
"name":"Mary",
"age":25,
"salary":500
}
]','$[0].name');
or to get the last record:
SELECT json_array_get('[
{
"name":"John",
"age":29,
"salary":300
},
{
"name":"Mary",
"age":25,
"salary":500
}
]',-1);
Upvotes: 2