jq: Flatten objects with unnecessary nested levels

Question

I'm facing the problem of having a json file where the same key sometimes has a flat value, while others it has an additional nested (and for my purposes unnecessary) level which then includes the related value.

The file is newline delimited and I am trying to get rid of any additional levels. So far I've managed to do that only if the nested level appears in the first branch of the tree, using

jq -c '[.] | map(.[] |= if type == "object" and (.number | length) > 0 then .numberLong else . end) | .[]' mongoDB.json

The example below illustrates that further. What I have initially:

  {
    "name": "John",
    "age": {
        "numberLong": 22
      }
  }
  {
    "name": "Jane",
    "age": 24
  }
  {
    "name": "Dennis",
    "age": 34,
    "details": [
      {
        "telephone_number": 555124124
      }
    ]
  }
  {
    "name": "Frances",
    "details": [
      {
        "telephone_number": {
            "numberLong": 444245523
          }
      }
    ]
  }

What my script does (the second numberLong is ignored):

  {
    "name": "John",
    "age": 22
  },
  {
    "name": "Jane",
    "age": 24
  }
  {
    "name": "Dennis",
    "age": 34,
    "details": [
      {
        "telephone_number": 555124124
      }
    ]
  }
  {
    "name": "Frances",
    "details": [
      {
        "telephone_number":  {
            "numberLong": 444245523
          }
      }
    ]
  }

What I am actually hoping to achieve (recursively copy the values of all numberLong keys one level up, regardless of where they belong in the file) :

[
  {
    "name": "John",
    "age": 22
  },
  {
    "name": "Jane",
    "age": 24
  },
  {
    "name": "Dennis",
    "age": 34,
    "details": [
      {
        "telephone_number": 555124124
      }
    ]
  },
  {
    "name": "Frances",
    "details": [
      {
        "telephone_number": 444245523
      }
    ]
  }
]

This transformation is part of a daily pipeline and is applied to several files with sizes up to 70GB, so speed while traversing the files could potentially be an issue. The problem stems from MongoDB's different types: MongoDB differences between NumberLong and simple Integer?

Thanks!

peak · Accepted Answer

If your jq has 'walk/1' then the simplest completely generic solution would be along these lines:

walk( if type=="object"
      then with_entries( if .value | (type == "object" and has("numberLong"))
                         then .value |= .numberLong
                         else . end)
      else . end )

If your jq does not have 'walk', then it would be best to upgrade, as that will also improve speed; otherwise you can google for its def in jq.

If this is too slow for your very large files, you may have to track down the precise locations where the transformation is needed to avoid the overhead of a completely generic approach.

Notes on handling very large files

Your example ("What I have initially") gives a stream of objects, so it might be worth pointing out that since jq is stream-oriented, it has no problem handling very large files consisting of streams of JSON entities (aka "documents") that are not so large individually.

(An approximate rule of thumb is that if the largest JSON entity in the input has size N units, and if the largest JSON entity created by jq has size M units, then jq might need access to about M + N + max(M,N) units of memory.)

To handle a very large file containing a single JSON array, it might be advisable to begin by producing a stream of the top-level elements for subsequent processing.

In the worst-possible case (a very large file with one very large, complex JSON document) you might have to use a streaming parser such as the one that jq has.

For illustrations of various techniques for handling very large files, see Process huge GEOJson file with jq

jq: Flatten objects with unnecessary nested levels

Answers (1)

Notes on handling very large files

Related Questions