Reputation: 11

Convert TSV file to multiple JSON arrays with jq

I would like to use the open data IMDb, but they serve it in TSV format, which is not very convenient.

https://datasets.imdbws.com/title.crew.tsv.gz

tconst  directors   writers
tt0000238   nm0349785   \N
tt0000239   nm0349785   \N
tt0000240   \N  \N
tt0000241   nm0349785   \N
tt0000242   nm0617588   nm0617588
tt0000243   nm0349785   \N
tt0000244   nm0349785   \N
tt0000245   \N  \N
tt0000246   nm0617588   \N
tt0000247   nm0002504,nm0005690,nm2156608   nm0000636,nm0002504
tt0000248   nm0808310   \N
tt0000249   nm0808310   \N
tt0000250   nm0005717   \N
tt0000251   nm0177862   \N

I want to convert TSV data to JSON.

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": [
      "\\N"
    ]
  }
]

I can do this with the command:

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

However, the file turns out to be very large, I get out of memory errors.

1. How can split this TSV file into several JSON files, each with 1000 records?

./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json

2. How can exclude empty fields? To have an empty array.

"writers": [ "\\N" ] -> "writers": [ ]

UPD (The second question was solved.):

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | 
         .[2] |= if .[0] == "\\N" then [] else . end | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": []
  }
]

Thanks for answers.

Upvotes: 1

Answers (3)

peak

Reputation: 116870

Since 1000 is a small number in the present context, here's a solution that does not use split; instead, it boils down to a single two-step pipeline.

The first part of the pipeline consists of an invocation of jq with the -c option (for converting the TSV into a stream of JSON arrays, one per chunk); this is described below.

The second part of the pipeline converts this stream of arrays into the desired set of files, one array per file; this part of the pipeline can easily be implemented using awk or a similar tool of your choice, and is not discussed further below.

program.jq

# Assemble the items in the (possibly empty) stream into a 
# (possibly empty) stream of arrays of length $n or less.
# $n can be any integer greater than 0;
# emit nothing if `stream` is empty.
def assemble(stream; $n):
  # box the input to detect eos
  foreach ((stream|[.]), null) as $item ({};
     (.array|length) as $l
     | if $item == null # eos
       then .emit = (0 < $l and $l < $n)
       else if $l == $n
            then .array = $item
            else .array += $item
            end
       | .emit = (.array|length == $n)
       end;

     if .emit then .array else empty end) ;


def stream:
  inputs
  | split("\t")
  | map_values(if . == "\\N" then "" else . end)
  | map(split(","))
  | { tconst: .[0][0],
      directors: .[1],
      writers:   .[2] };
      
assemble(stream; 1000)

Invocation:

To skip the header, we omit the -n command-line option that would be used if there were no header:

jq -Rc -f program.jq input.tsv

Upvotes: 0

tshiono

Reputation: 22032

If python is your option, how about making use of it because the data structure of python has a high compatibility with json. Would you please try:

#!/usr/bin/python

import json

ary = []                                        # declare an empty array
with open('./title.crew.tsv') as f:
    header = f.readline().rstrip().split('\t')  # read the header line and split
    for line in f:                              # iterate the following lines
        body = line.rstrip().split('\t')
        d = {}                                  # empty dictionary
        for i in range(0, len(header)):
            if ',' in body[i]:                  # if the value contains ","
                b = body[i].split(',')          # then split the value on it
            else:
                b = body[i]
            if b == '\N':                       # if the value is "\N"
                b = []                          # then replace with an empty array
            d[header[i]] = b                    # generate an object
        ary.append(d)                           # append the object to the array
print(json.dumps(ary, indent=2))

Output:

[
  {
    "directors": "nm0349785", 
    "tconst": "tt0000238", 
    "writers": []
  }, 
  {
    "directors": "nm0349785", 
    "tconst": "tt0000239", 
    "writers": []
  }, 
  {
    "directors": [], 
    "tconst": "tt0000240", 
    "writers": []
  }, 
<..SNIPPED..>

As python is a general programing language, it has a high flexibility to process the input. It is also easy to split the result into multiple json files.

Upvotes: 0

peak

Reputation: 116870

they serve it in TSV format, which is not very convenient.

Actually, jq and TSV go extremely well together, and certainly using jq to process TSV files does not require using the -s ("slurp") option, which indeed is usually (but by no means always) best avoided.

If your goal were simply to produce a stream of the “tconst” objects, you could process the TSV file on a line-by-line basis; if you wanted to assemble that stream into a single array, then you could use jq with the -c option to produce a stream with one JSON object per line, and then assemble them together using a tool such as awk (i.e., simply adding the opening and closing brackets and the delimiting commas).

In your case, though, it would probably be simplest to split the TSV file first (e.g. using the unix/linux/mac split command -- see below) and then process each file along the lines of your jq program. Since your chunks are quite small (1000 objects each), you could even use jq with the -s option, but it's just as easy to use inputs and the -n command-line option instead:

jq -n '[inputs]'

Or you could combine these strategies: split into chunks, and process each chunk using jq with the -c option to produce a stream, and assembling each such stream into a JSON array.

split

For splitting a file into chunks, see for example:

How to split a large text file into smaller files with equal number of lines?

Split text file into smaller multiple text file using command line

and many others.