Reputation: 529
I have files in S3 with inline JSON per line of the structure:
{ "resources": [{"resourceType":"A","id":"A",...},{...}] }
If I run glue over it, I get "resource: array" as the top level element. I want however the elements of the array to be inspected and used as the top level table elements. All the elements per resources array will have the same schema. So I expect
resourceType: string
id: string
....
Theoretically, a custom JSON classifier should handle this:
$.resources[*]
However, the path is not picked up. So I still get the resources:array as the top level element.
I could now run some pre-processing to extract the array elements myself and write them line per line. However, I want to understand why my path is not working.
UPDATE 1:
It might be something with the JSON that I do not understand (its valid JSON produced via JAVA Jackson). If I remove the outer object with the resources attribute and change the structure to
[{"resourceType":"A","id":"A",...},{...}]
the classifier $[*]
should pick the sub-objects up. But I still get array:array as top level element.
UPDATE 2:
Its indeed a formatting issue. If I change the JSON files to
[
{"resourceType":"A","id":"A",...},{...}
]
$[*]
starts to work.
UPDATE 3:
Its however not fixing the issue with $.resources[*]
to reformat to
{
"resources": [
{"resourceType":"A","id":"A",...},{...}
]
}
UPDATE 4:
If I take my file and run it through a Intellij re-format, hence produce a JSON object where all nested elements have line breaks, it also starts working with $.resources[*]
. Basically, like in UPDATE 3 just applied down the structure.
{
"resources": [
{
"resourceType":"A",
"id":"A"
},
{
...
}
]
}
What bothers me is, that the requirements regarding the structure are still not clear to me, since UPDATE 2 worked, but not UPDATE 3. I also find nowhere in the documentation a formal requirement regarding the JSON structure.
In this sense, I think I got to the conclusion of my own question, but the systematics stay a bit unclear.
Upvotes: 3
Views: 3790
Reputation: 529
To conclude here:
The issue is related to unclear documented JSON formatting requirements of Glue.
A normalisation via json.dumps(my_json, separators=(',',':'))
produces compact JSON that works for my use case.
I normalised now the content via a lambda.
Lambda code as reference for whomever it may help:
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=my_bucket)
for page in pages:
try:
contents = page["Contents"]
except KeyError:
break
for obj in contents:
key = obj["Key"]
obj = s3.get_object(Bucket=my_bucket, Key=key)
j = json.loads(obj['Body'].read().decode('utf-8'))
new_json = json.dumps(j, separators=(',',':'))
target = 'nrmlzd/' + key
s3.put_object(
Body=new_json,
Bucket=my_bucket,
Key= target
)
Upvotes: 1