Reputation: 329
I am trying to parse a JSON object that consists of a few hashes and a massive array of hashes (sometimes 300,000 hashes inside the array, 200MB). Here is an example of the JSON object. I need to parse hash by hash inside the array report_datasets
.
https://api.datacite.org/reports/0cb326d1-e3e7-4cc1-9d86-7c5f3d5ca310
{ report_header: {report_id: 33738, report_name: "Report first"},
report_datasets: [
{dataset_id:1, yop:1990},
{dataset_id:2, yop:2007},
{dataset_id:3, yop:1983},
.
{dataset_id:578999,yop:1964},
]
}
In every approach I tried, including a few approaches using yajl-ruby
and json-streamer
, my app is killed. When I use parse_chunk
,
def parse_very_large_json
options= {symbolize_keys:false}
parser = Yajl::Parser.new(options)
parser.on_parse_complete = method(:print_each_item)
report_array = parser.parse_chunk(json_string)
end
def print_each_item report
report["report-datasets"].each do |dataset|
puts “this is an element of the array“
puts dataset
end
end
parsing happens, but eventually again it is killed.
The problem seems to be that there is not much difference between Yajl::Parser.new().parse
and Yajl::Parser.new().parse_chunk
in both approaches that are killed.
How can one parse the elements of such a massive JSON array efficiently without killing the rails app?
Upvotes: 3
Views: 463