Robert
Robert

Reputation: 529

AWS Glue classifier for extracting JSON array values

I have files in S3 with inline JSON per line of the structure:

{ "resources": [{"resourceType":"A","id":"A",...},{...}] }

If I run glue over it, I get "resource: array" as the top level element. I want however the elements of the array to be inspected and used as the top level table elements. All the elements per resources array will have the same schema. So I expect

resourceType: string
id: string
....

Theoretically, a custom JSON classifier should handle this:

$.resources[*]

However, the path is not picked up. So I still get the resources:array as the top level element.

I could now run some pre-processing to extract the array elements myself and write them line per line. However, I want to understand why my path is not working.

UPDATE 1:

It might be something with the JSON that I do not understand (its valid JSON produced via JAVA Jackson). If I remove the outer object with the resources attribute and change the structure to

[{"resourceType":"A","id":"A",...},{...}]

the classifier $[*] should pick the sub-objects up. But I still get array:array as top level element.

UPDATE 2:

Its indeed a formatting issue. If I change the JSON files to

[
  {"resourceType":"A","id":"A",...},{...}
]

$[*] starts to work.

UPDATE 3:

Its however not fixing the issue with $.resources[*] to reformat to

{
    "resources": [
        {"resourceType":"A","id":"A",...},{...}
    ]
}

UPDATE 4:

If I take my file and run it through a Intellij re-format, hence produce a JSON object where all nested elements have line breaks, it also starts working with $.resources[*]. Basically, like in UPDATE 3 just applied down the structure.

{
    "resources": [
        {
          "resourceType":"A",
          "id":"A"
        },
        {
          ...
        }
    ]
}

What bothers me is, that the requirements regarding the structure are still not clear to me, since UPDATE 2 worked, but not UPDATE 3. I also find nowhere in the documentation a formal requirement regarding the JSON structure.

In this sense, I think I got to the conclusion of my own question, but the systematics stay a bit unclear.

Upvotes: 3

Views: 3790

Answers (1)

Robert
Robert

Reputation: 529

To conclude here:

The issue is related to unclear documented JSON formatting requirements of Glue.

A normalisation via json.dumps(my_json, separators=(',',':')) produces compact JSON that works for my use case.

I normalised now the content via a lambda.

Lambda code as reference for whomever it may help:

    s3 = boto3.client('s3')

    paginator = s3.get_paginator('list_objects_v2')

    pages = paginator.paginate(Bucket=my_bucket)
    
    for page in pages:
        try:
            contents = page["Contents"]
        except KeyError:
            break

        for obj in contents:
            key = obj["Key"]
            obj = s3.get_object(Bucket=my_bucket, Key=key)
            j = json.loads(obj['Body'].read().decode('utf-8'))
            
            new_json = json.dumps(j, separators=(',',':'))
                
            target = 'nrmlzd/' + key
                
            s3.put_object(
                Body=new_json,
                Bucket=my_bucket,
                Key= target
            )

Upvotes: 1

Related Questions