Reputation: 11
I would like to use the open data IMDb, but they serve it in TSV format, which is not very convenient.
https://datasets.imdbws.com/title.crew.tsv.gz
tconst directors writers
tt0000238 nm0349785 \N
tt0000239 nm0349785 \N
tt0000240 \N \N
tt0000241 nm0349785 \N
tt0000242 nm0617588 nm0617588
tt0000243 nm0349785 \N
tt0000244 nm0349785 \N
tt0000245 \N \N
tt0000246 nm0617588 \N
tt0000247 nm0002504,nm0005690,nm2156608 nm0000636,nm0002504
tt0000248 nm0808310 \N
tt0000249 nm0808310 \N
tt0000250 nm0005717 \N
tt0000251 nm0177862 \N
I want to convert TSV data to JSON.
[
{
"tconst": "tt0000247",
"directors": [
"nm0005690",
"nm0002504",
"nm2156608"
],
"writers": [
"nm0000636",
"nm0002504"
]
},
{
"tconst": "tt0000248",
"directors": [
"nm0808310"
],
"writers": [
"\\N"
]
}
]
I can do this with the command:
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
However, the file turns out to be very large, I get out of memory errors.
1. How can split this TSV file into several JSON files, each with 1000 records?
./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json
2. How can exclude empty fields? To have an empty array.
"writers": [ "\\N" ]
-> "writers": [ ]
UPD (The second question was solved.):
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] |
.[2] |= if .[0] == "\\N" then [] else . end | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
[
{
"tconst": "tt0000247",
"directors": [
"nm0005690",
"nm0002504",
"nm2156608"
],
"writers": [
"nm0000636",
"nm0002504"
]
},
{
"tconst": "tt0000248",
"directors": [
"nm0808310"
],
"writers": []
}
]
Thanks for answers.
Upvotes: 1
Views: 738
Reputation: 116870
Since 1000 is a small number in the present context, here's a solution that does not use split
; instead, it boils down to a single two-step pipeline.
The first part of the pipeline consists of an invocation of jq with the -c option (for converting the TSV into a stream of JSON arrays, one per chunk); this is described below.
The second part of the pipeline converts this stream of arrays into the desired set of files, one array per file; this part of the pipeline can easily be implemented using awk
or a similar tool of your choice, and is not discussed further below.
# Assemble the items in the (possibly empty) stream into a
# (possibly empty) stream of arrays of length $n or less.
# $n can be any integer greater than 0;
# emit nothing if `stream` is empty.
def assemble(stream; $n):
# box the input to detect eos
foreach ((stream|[.]), null) as $item ({};
(.array|length) as $l
| if $item == null # eos
then .emit = (0 < $l and $l < $n)
else if $l == $n
then .array = $item
else .array += $item
end
| .emit = (.array|length == $n)
end;
if .emit then .array else empty end) ;
def stream:
inputs
| split("\t")
| map_values(if . == "\\N" then "" else . end)
| map(split(","))
| { tconst: .[0][0],
directors: .[1],
writers: .[2] };
assemble(stream; 1000)
To skip the header, we omit the -n command-line option that would be used if there were no header:
jq -Rc -f program.jq input.tsv
Upvotes: 0
Reputation: 22032
If python
is your option, how about making use of it because the data structure of python has a high compatibility with json
. Would you please try:
#!/usr/bin/python
import json
ary = [] # declare an empty array
with open('./title.crew.tsv') as f:
header = f.readline().rstrip().split('\t') # read the header line and split
for line in f: # iterate the following lines
body = line.rstrip().split('\t')
d = {} # empty dictionary
for i in range(0, len(header)):
if ',' in body[i]: # if the value contains ","
b = body[i].split(',') # then split the value on it
else:
b = body[i]
if b == '\N': # if the value is "\N"
b = [] # then replace with an empty array
d[header[i]] = b # generate an object
ary.append(d) # append the object to the array
print(json.dumps(ary, indent=2))
Output:
[
{
"directors": "nm0349785",
"tconst": "tt0000238",
"writers": []
},
{
"directors": "nm0349785",
"tconst": "tt0000239",
"writers": []
},
{
"directors": [],
"tconst": "tt0000240",
"writers": []
},
<..SNIPPED..>
As python
is a general programing language, it has a high flexibility to process the input. It is also easy to split the result into multiple json files.
Upvotes: 0
Reputation: 116870
they serve it in TSV format, which is not very convenient.
Actually, jq and TSV go extremely well together, and certainly using jq to process TSV files does not require using the -s ("slurp") option, which indeed is usually (but by no means always) best avoided.
If your goal were simply to produce a stream of the “tconst” objects, you could process the TSV file on a line-by-line basis; if you wanted to assemble that stream into a single array, then you could use jq with the -c option to produce a stream with one JSON object per line, and then assemble them together using a tool such as awk
(i.e., simply adding the opening and closing brackets and the delimiting commas).
In your case, though, it would probably be simplest to split the TSV file first (e.g. using the unix/linux/mac split
command -- see below) and then process each file along the lines of your jq program. Since your chunks are quite small (1000 objects each), you could even use jq with the -s option, but it's just as easy to use inputs
and the -n command-line option instead:
jq -n '[inputs]'
Or you could combine these strategies: split into chunks, and process each chunk using jq with the -c option to produce a stream, and assembling each such stream into a JSON array.
For splitting a file into chunks, see for example:
How to split a large text file into smaller files with equal number of lines?
Split text file into smaller multiple text file using command line
and many others.
Upvotes: 1