James Brown
James Brown

Reputation: 37404

Converting JSON with arrays to CSV using jq

I found myself in the world of JSON and I'm trying to convert out of it using jq. I'm trying to convert following structure to CSV:

{
  "Action": "A1",
  "Group": [
    {
      "Id": "10",
      "Units": [
        "1"
      ]
    }
  ]
}
{
  "Action": "A2",
  "Group": [
    {
      "Id": "11",
      "Units": [
        "2"
      ]
    },
    {
      "Id": "20",
      "Units": []
    }
  ]
}
{
  "Action": "A1",
  "Group": [
    {
      "Id": "26",
      "Units": [
        "1",
        "3"
      ]
    }
  ]
}
{
  "Action": "A3",
  "Group": null
}

where the Ids are between 10-99 and Units 1-5. Expected output would be (quoted or unquoted, comma separated or not, I used pipe separators for clarity):

Action|Group|Unit1|Unit2|Unit3|Unit4|Unit5
A1|10|1|0|0|0|0
A2|11|0|1|0|0|0
A2|20|0|0|0|0|0
A1|26|1|0|1|0|0
A3|0|0|0|0|0|0

I've played around with this for a while now (history | grep jq | wc -l says 107) but haven't made any real progress to combining the keys with eachother, I'm basically just getting lists of keys (jq n00b).

Update:

Testing the solution (sorry, been a bit s l o w) I noticed that the data also has records with "Group": nulls, ie.:

{
  "Action": "A3",
  "Group": null
}

(above few lines added to the main test data set) which results in error: jq: error (at file.json:61): Cannot iterate over null (null). Expected output would be:

A3|0|0|0|0

Is there an easy way out of that one?

Upvotes: 2

Views: 1642

Answers (4)

peak
peak

Reputation: 116730

Here is a solution with minimal memory requirements for the case where the number of "Unit" columns (n) is unknown beforehand. In the first pass, n is computed.

stream.jq

This is for the second pass:

# Output: a stream of arrays.
def synopsis:
  inputs
  | .Action as $a
  | .Group |= (. // [{Id:0, Units:[]}])
  | .Group[] 
  | [$a, .Id, (.Units|map(tonumber-1))];

def h(n): ["Action", "Group", "Unit\(range(1;n+1))"];

# Output: an array suitable for @csv
def stream(n):
  def i: reduce .[] as $i ([range(0;n)|0]; .[$i]=1);
  .[0:2] + (.[2] | i) ;

h($width), (synopsis | stream($width)) | @csv

Invocation

jq -rn --argjson width $(jq -n '
  [inputs|(.Group//[{Units:[]}])[]|.Units|map(tonumber)|max]|max
  ' data.json) -f stream.jq data.json

Output

This is the output with the "null" record ({"Action": "A3","Group": null}) appended:

"Action","Group","Unit1","Unit2","Unit3"
"A1","10",1,0,0
"A2","11",0,1,0
"A2","20",0,0,0
"A1","26",1,0,1
"A3",0,0,0,0

Upvotes: 1

jq170727
jq170727

Reputation: 14655

Here is a general solution if the set of unit columns isn't known in advance:

def normalize: [            # convert input to array of flattened objects e.g. 
      inputs                # [{"Action":"A1","Group":"10","Unit1":"1"}, ...]
    | .Action as $a
    | .Group[]
    |   {Action:$a, Group:.Id}
      + reduce .Units[] as $u ({};.["Unit\($u)"]="1")
  ];

def columns:                # compute column names
  [ .[] | keys[] ] | unique ;

def rows($names):           # generate row arrays
    .[] | [ .[$names[]] ] | map( .//"0" );

normalize | columns as $names | $names, rows($names) | join("|")

Sample Run (assumes filter in filter.jq and data in data.json)

$ jq -Mnr -f filter.jq data.json
Action|Group|Unit1|Unit2|Unit3
A1|10|1|0|0
A2|11|0|1|0
A2|20|0|0|0
A1|26|1|0|1

Try it online!

In this specific problem the ordering done by unique matches the column output we want. If that were not the case columns would be more complicated.

Much of the complexity comes from dealing with not knowing the final set of Unit columns. If the unit set is fixed and reasonably small (e.g. 1-5) a simpler filter can be used:

  ["\(1+range(5))"] as $units
| ["Action", "Group", "Unit\($units[])"]
, ( inputs 
  | .Action as $a 
  | .Group[] 
  | [$a, .Id, (.Units[$units[]|[.]] | if .!=[] then "1" else "0" end) ]
) | join("|")

Sample Run

$ jq -Mnr '["\(1+range(5))"] as $units | ["Action", "Group", "Unit\($units[])"], (inputs | .Action as $a | .Group[] | [$a, .Id, (.Units[$units[]|[.]] | if .!=[] then "1" else "0" end) ] ) | join("|")' data.json
Action|Group|Unit1|Unit2|Unit3|Unit4|Unit5
A1|10|1|0|0|0|0
A2|11|0|1|0|0|0
A2|20|0|0|0|0|0
A1|26|1|0|1|0|0

Try it online at tio.run or at jqplay.org


To handle the case where Group may be null the easiest way is to use a variation of peak's suggestion. E.g

  ["\(1+range(5))"] as $units
| ["Action", "Group", "Unit\($units[])"]
, ( inputs 
  | .Action as $a 
  | ( .Group // [{Id:"0", Units:[]}] )[]   # <-- supply default group if null
  | [$a, .Id, (.Units[$units[]|[.]] | if .!=[] then "1" else "0" end) ]
) | join("|")

Try it online at tio.run or jqplay.org

Upvotes: 3

peak
peak

Reputation: 116730

This is for the case where the number of "Unit" columns (n) is unknown beforehand. It avoids reading in the entire file at once, and proceeds in three main steps: the relevant information is collected in a compact form by "synopsis"; n is computed; and the full rows are formed.

For simplicity, the following is for jq version 1.5 or later, and uses @csv. Small tweaks might be needed if jq 1.4 is used, depending on detailed requirements regarding the output.

Invocation

jq -nr -f tocsv.jq input.json

tocsv.jq:

# Input: a stream of JSON objects.
# Output: a stream of arrays.
def synopsis:
  inputs 
  | .Action as $a 
  | .Group[] 
  | [$a, .Id, (.Units|map(tonumber-1))];

# Input: an array of arrays
# Output: a stream of arrays suitable for @csv
def stream:
  def h(n): ["Action", "Group", "Unit\(range(1;n+1))"];
  def i(n): reduce .[] as $i ([range(0;n)|0]; .[$i]=1);
  (map(.[2] | max) | max + 1) as $n
  | h($n),
    (.[] | .[0:2] + (.[2] | i($n)))
  ;

[synopsis] | stream | @csv

Output

"Action","Group","Unit1","Unit2","Unit3"
"A1","10",1,0,0
"A2","11",0,1,0
"A2","20",0,0,0
"A1","26",1,0,1

Update

The easiest way to handle the "Group":null cases is probably to add the following line immediately before | .Group[]:

| .Group |= (. // [{Id:"0", Units:[]}])

That way you can also easily change the "default" value of "Id".

Upvotes: 2

peak
peak

Reputation: 116730

This is for the case where the number of "Unit" columns (n) is known beforehand. It is just a variant of @jq170717's implementation.

The use of max is to ensure reasonable behavior if the given value of n is too small. In that case, the number of columns in the output will vary.

The following has been tested with jq versions 1.5 and master; see below for the tweaks necessary for earlier versions of jq.

Invocation: jq -nr -f tocsv.jq data.json

tocsv.jq:

# n is the number of desired "Unit" columns
def tocsv(n):
  def h: ["Action", "Group", "Unit\(range(1;n+1))"];
  def i(n): reduce .[] as $i ([range(0;n)|"0"]; .[$i]="1");
  def p:
    inputs 
    | .Action as $a 
    | .Group[] 
    | [$a, .Id] + (.Units | map(tonumber-1) | i(n));
  h,p | join(",") ;

tocsv(5)

The above has been written in a way that you can simply replace the call to join by a call to @csv or @tsv if you want all their benefits. In that case, though, you might want to use 0 and 1 rather than "0" and "1" in the indicator function, i.

Verification

$ jq -nr -f tocsv.jq data.json
Action,Group,Unit1,Unit2,Unit3
A1,10,1,0,0
A2,11,0,1,0
A2,20,0,0,0
A1,26,1,0,1

jq 1.3 or jq 1.4

For jq 1.3 or 1.4, change inputs to .[], and use the following incantation:

 jq -r -s -f tocsv.jq data.json

Update

The easiest way to handle the "Group":null cases is probably to add the following line immediately before | .Group[]:

| .Group |= (. // [{Id:"0", Units:[]}])

That way you can also easily change the "default" value of "Id".

Upvotes: 2

Related Questions