Reputation: 449
I have a 19 gigs JSON file. A huge array of rather small objects.
[{
"name":"Joe Blow",
"address":"Gotham, CA"
"log": [{},{},{}]
},
...
]
I want to iterate thru the root array of this JSON. Every object with the log takes no more than 2MB of memory. It is possible to load one object into a memory, work with it and throw it away.
Yet the file by itself is 19 gigs. It has millions of those objects. I found it is possible to iterate thru such an array by using C# and Newtonsoft.Json library. You just read a file in a stream and as soon as you see finished object, serialize it and spit it out.
But I want to see if the powershell can do the same? Not to read the whole thing as one chunk, but rather iterate what you have in the hopper right now.
Any ideas?
Upvotes: 3
Views: 2246
Reputation: 27516
As far as I know, convertfrom-json doesn't have a streaming mode, but jq does: Processing huge json-array files with jq. This code will turn a giant array into just the contents of the array, that can be output piece by piece. Otherwise a 6mb, 400000 line json file can use 1 gig of memory after conversion (400 megs in powershell 7).
get-content file.json |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
% { $_ | convertfrom-json }
So for example this:
[
{"name":"joe"},
{"name":"john"}
]
becomes this:
{"name":"joe"}
{"name":"john"}
The streaming format of jq looks very different from json. For example, the array looks like this, with paths to each value and object or array end-markers.
'[{"name":"joe"},{"name":"john"}]' | jq --stream -c
[[0,"name"],"joe"]
[[0,"name"]] # end object
[[1,"name"],"john"]
[[1,"name"]] # end object
[[1]] # end array
And then after truncating "1" "parent folder" in the path of the two values:
'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream '1|truncate_stream(inputs)'
[["name"],"joe"]
[["name"]] # end object
[["name"],"john"]
[["name"]] # end object
# no more end array
"fromstream()" turns it back into json...
'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream 'fromstream(1|truncate_stream(inputs))'
{"name":"joe"}
{"name":"john"}
Upvotes: 3