Reputation: 4147
I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.
[{"key":"value"}, {"key":"value"}]
I'm using the custom Classifier:
$[*]
When trying to query from Spectrum, however, it returns:
Top level Ion/JSON structure must be an anonymous array if and only if serde property 'strip.outer.array' is set. Mismatch occured in file...
I set that serde property manually in the Glue catalog table, but nothing changed.
Is it no possible to query an anonymous array via Spectrum?
Upvotes: 3
Views: 2886
Reputation: 1
The documentation on creating Spectrum tables explains this issue.
"This parameter supports the following SerDe property for JsonSerDe:
'strip.outer.array'='true'
Processes Ion/JSON files containing one very large array enclosed in outer brackets ( [ … ] ) as if it contains multiple JSON records within the array."
You can fix it in the console by navigating to the table that gets created by Glue, go to the Serde parameters section (under "Advanced properties"), and adding a new key-value pair with strip.outer.array as the key and true as the value.
Upvotes: 0
Reputation: 11
I've successfully done this, but without a data classifier. My JSON file looks like:
[
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
...
]
I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.
I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:
anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string
I manually updated my schema to:
col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string
(And also add the serde param strip.outer.array.)
Then I had to rerun my crawler, and finally I could query in Spectrum like:
select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;
Upvotes: 1
Reputation: 9
If your table's structure is as follows:
CREATE EXTERNAL TABLE spectrum.testjson(struct<id:varchar(25),
columnName<array<struct<key:varchar(20),value:varchar(20)>>>);
you can use the following query to access the array element:
SELECT c.id, o.key, o.value FROM spectrum.testjson c, c.columnName o;
For more information you can refer the AWS Documentation:
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data-sqlextensions.html
Upvotes: 0
Reputation: 9
You can use json_extract_path_text for extracting the element or json_extract_array_element_text('json string', pos [, null_if_invalid ] ).
for example: for 2nd index element select json_extract_array_element_text('[111,112,113]', 2);
output: 113
Upvotes: 0
Reputation: 4147
Naming the array in the JSON file like this:
"values":[{"key":"value"},...}
And updating the classifier:
$.values[*]
Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.
Update: In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:
{"key":"value"}
{"key":"value"}
etc.
It does seem to be a Spectrum specific issue, as Athena would still work.
Interested to know if anyone else was able to get it to work...
Upvotes: 2