rename invalid keys from JSON

Question

I have following flow in NIFI , JSON has (1000+) objects in it.

invokeHTTP->SPLIT JSON->putMongo

Flow works fine, till I receive some keys in json with "." in the name. e.g. "spark.databricks.acl.dfAclsEnabled".

my current solution is not optimal, I have jotted down bad keys, and using multiple replace text processor to replace "." with "_". I am not using REGEX, I am using string literal find/replace. So each time I am getting failure in putMongo processor, I am inserting new replaceText processor.

This is not maintainable. I am wondering if I can use JOLT for this? couple of info regarding input JSON.

1) no set structure, only thing that is confirmed is. everything will be in events array. But event object itself is free form.

2) maximum list size = 1000.

3) 3rd party JSON, so I cant ask for change in format.

Also, key with ".", can appear anywhere. So I am looking for JOLT spec that can cleanse at all level and then rename it.

{
  "events": [
    {
            "cluster_id": "0717-035521-puny598",
            "timestamp": 1531896847915,
            "type": "EDITED",
            "details": {
                "previous_attributes": {
                    "cluster_name": "Kylo",
                    "spark_version": "4.1.x-scala2.11",
                    "spark_conf": {
                        "spark.databricks.acl.dfAclsEnabled": "true",
                        "spark.databricks.repl.allowedLanguages": "python,sql"
                    },
                    "node_type_id": "Standard_DS3_v2",
                    "driver_node_type_id": "Standard_DS3_v2",
                    "autotermination_minutes": 10,
                    "enable_elastic_disk": true,
                    "cluster_source": "UI"
                },
                "attributes": {
                    "cluster_name": "Kylo",
                    "spark_version": "4.1.x-scala2.11",
                    "node_type_id": "Standard_DS3_v2",
                    "driver_node_type_id": "Standard_DS3_v2",
                    "autotermination_minutes": 10,
                    "enable_elastic_disk": true,
                    "cluster_source": "UI"
                },
                "previous_cluster_size": {
                    "autoscale": {
                        "min_workers": 1,
                        "max_workers": 8
                    }
                },
                "cluster_size": {
                    "autoscale": {
                        "min_workers": 1,
                        "max_workers": 8
                    }
                },
                "user": ""
            }
        },
    {
      "cluster_id": "0717-035521-puny598",
      "timestamp": 1535540053785,
      "type": "TERMINATING",
      "details": {
        "reason": {
          "code": "INACTIVITY",
          "parameters": {
            "inactivity_duration_min": "15"
          }
        }
      }
    },
    {
      "cluster_id": "0717-035521-puny598",
      "timestamp": 1535537117300,
      "type": "EXPANDED_DISK",
      "details": {
        "previous_disk_size": 29454626816,
        "disk_size": 136828809216,
        "free_space": 17151311872,
        "instance_id": "6cea5c332af94d7f85aff23e5d8cea37"
      }
    }
  ]
}

Andy · Accepted Answer

I created a template using ReplaceText and RouteOnContent to perform this task. The loop is required because the regex only replaces the first . in the JSON key on each pass. You might be able to refine this to perform all substitutions in a single pass, but after fuzzing the regex with the look-ahead and look-behind groups for a few minutes, re-routing was faster. I verified this works with the JSON you provided, and also JSON with the keys and values on different lines (: on either):

...
"spark_conf": {
                        "spark.databricks.acl.dfAclsEnabled":
 "true",
                        "spark.databricks.repl.allowedLanguages"
: "python,sql"
                    },
...

You could also use an ExecuteScript processor with Groovy to ingest the JSON, quickly filter all JSON keys that contain ., perform a collect operation to do the replacement, and re-insert the keys in the JSON data if you want a single processor to do this in a single pass.

rename invalid keys from JSON

Answers (1)

Related Questions