CYC
CYC

Reputation: 325

jq parse json with stream flag into different json file

I have a json file as below called data.json, I want to parse the data with jq tool in streaming mode(do not load the whole file into memory), because the real data have 20GB

the streaming mode in jq seems to add a flag --stream and it will parse the json file row by row

{
  "id": {
    "bioguide": "E000295",
    "thomas": "02283",
    "govtrack": 412667,
    "opensecrets": "N00035483",
    "lis": "S376"
  },
  "bio": {
    "gender": "F",
    "birthday": "1970-07-01"
  },
  "tooldatareports": [
    {
      "name": "A",
      "tooldata": [
        {
          "toolid": 12345,
          "data": [
            {
              "time": "2021-01-01",
              "value": 1
            },
            {
              "time": "2021-01-02",
              "value": 10
            },
            {
              "time": "2021-01-03",
              "value": 5
            }
          ]
        },
        {
          "toolid": 12346,
          "data": [
            {
              "time": "2021-01-01",
              "value": 10
            },
            {
              "time": "2021-01-02",
              "value": 100
            },
            {
              "time": "2021-01-03",
              "value": 50
            }
          ]
        }
      ]
    }
  ]
}

The final result I hope it can become as below

A list contains two dict, each dict contain 2 keys

[
  {
    "data": [
      {
        "time": "2021-01-01",
        "value": 1
      },
      {
        "time": "2021-01-02",
        "value": 10
      },
      {
        "time": "2021-01-03",
        "value": 5
      }
    ]
  },
  {
    "data": [
      {
        "time": "2021-01-01",
        "value": 10
      },
      {
        "time": "2021-01-02",
        "value": 100
      },
      {
        "time": "2021-01-03",
        "value": 50
      }
    ]
  }
]

For this problem, I use the below command line to get a result, but it still has some differences.

cat data.json | jq --stream 'select(.[0][0]=="tooldatareports" and .[0][2]=="tooldata" and .[1]!=null) | .'

  1. the result is not a list contain a lot of dict
  2. for each time and value are separate in the different list

Does anyone have any idea about this?

Upvotes: 1

Views: 991

Answers (3)

Logan Lee
Logan Lee

Reputation: 997

Here's a step-by-step explanation of peak's answers.

First let's convert the json to stream.

https://jqplay.org/s/VEunTmDSkf

[["id","bioguide"],"E000295"]
[["id","thomas"],"02283"]
[["id","govtrack"],412667]
[["id","opensecrets"],"N00035483"]
[["id","lis"],"S376"]
[["id","lis"]]
[["bio","gender"],"F"]
[["bio","birthday"],"1970-07-01"]
[["bio","birthday"]]
[["tooldatareports",0,"name"],"A"]
[["tooldatareports",0,"tooldata",0,"toolid"],12345]
[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"toolid"],12346]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]
[["tooldatareports",0,"tooldata",1]]
[["tooldatareports",0,"tooldata"]]
[["tooldatareports",0]]
[["tooldatareports"]]

Now do .[0] to extract the path portion of stream.

https://jqplay.org/s/XdPrp8RuEj

["id","bioguide"]
["id","thomas"]
["id","govtrack"]
["id","opensecrets"]
["id","lis"]
["id","lis"]
["bio","gender"]
["bio","birthday"]
["bio","birthday"]
["tooldatareports",0,"name"]
["tooldatareports",0,"tooldata",0,"toolid"]
["tooldatareports",0,"tooldata",0,"data",0,"time"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"time"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"time"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2]
["tooldatareports",0,"tooldata",0,"data"]
["tooldatareports",0,"tooldata",1,"toolid"]
["tooldatareports",0,"tooldata",1,"data",0,"time"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"time"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"time"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2]
["tooldatareports",0,"tooldata",1,"data"]
["tooldatareports",0,"tooldata",1]
["tooldatareports",0,"tooldata"]
["tooldatareports",0]
["tooldatareports"]

Let me first quickly explain index\1.

index("data") of [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"] is 4 since that is the index of the first occurrence of "data".

Knowing that let's now do .[0] | index("data").

https://jqplay.org/s/ny0bV1xEED

null
null
null
null
null
null
null
null
null
null
null
4
4
4
4
4
4
4
4
4
4
4
null
4
4
4
4
4
4
4
4
4
4
4
null
null
null
null

As you can see in our case the indexes are either 4 or null. We want to filter each input such that the corresponding index is not null. Those are the input that have "data" as part of their path.

(.[0] | index("data")) as $ix | select($ix) does just that. Remember that each $ix is mapped to each input. So only input with their $ix being not null are displayed.

For example see https://jqplay.org/s/NwcD7_USZE Here inputs | select(null) gives no output but inputs | select(true) outputs every input.

These are the filtered stream:

https://jqplay.org/s/SgexvhtaGe

[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]

Before we go further let's review update assignment.

Have a look at https://jqplay.org/s/g4P6j8f9FG

Let's say we have input [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"].

Then filter .[0] |= .[4:] produces [["data",0,"time"],"2021-01-01"].

Why?

Remember that right hand side (.[4:]) inherits the context of the left hand side(.[0]). So in this case it has the effect of updating the path ["tooldatareports",0,"tooldata",0,"data",0,"time"] to ["data",0,"time"].

Let's move on then.

So (.[0] | index("data")) as $ix | select($ix) | .[0] |= .[$ix:] has the output:

https://jqplay.org/s/AwcQpVyHO2

[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],1]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],10]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],5]
[["data",2,"value"]]
[["data",2]]
[["data"]]
[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],10]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],100]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],50]
[["data",2,"value"]]
[["data",2]]
[["data"]]

Now all we need to do is convert this stream back to json.

https://jqplay.org/s/j2uyzEU_Rc

[fromstream(inputs)] gives:

[
  {
    "data": [
      {
        "time": "2021-01-01",
        "value": 1
      },
      {
        "time": "2021-01-02",
        "value": 10
      },
      {
        "time": "2021-01-03",
        "value": 5
      }
    ]
  },
  {
    "data": [
      {
        "time": "2021-01-01",
        "value": 10
      },
      {
        "time": "2021-01-02",
        "value": 100
      },
      {
        "time": "2021-01-03",
        "value": 50
      }
    ]
  }
]

This is the output we wanted.

Upvotes: 1

peak
peak

Reputation: 116900

Here's a solution that does not use truncate_stream:

jq -n --stream '
 [fromstream( 
   inputs
   | (.[0] | index("data")) as $ix
   | select($ix)
   | .[0] |= .[$ix:] )]
' input.json

Upvotes: 1

peak
peak

Reputation: 116900

The following produces the required output:

jq -n --stream '
  [{data: fromstream(5|truncate_stream(inputs))}]
' input.json 

Needless to say, there are other variations ...

Upvotes: 0

Related Questions