kriztean
kriztean

Reputation: 329

Parse a massive JSON array of hashes

I am trying to parse a JSON object that consists of a few hashes and a massive array of hashes (sometimes 300,000 hashes inside the array, 200MB). Here is an example of the JSON object. I need to parse hash by hash inside the array report_datasets.

https://api.datacite.org/reports/0cb326d1-e3e7-4cc1-9d86-7c5f3d5ca310

{ report_header: {report_id: 33738, report_name: "Report first"},
  report_datasets: [
  {dataset_id:1, yop:1990},
  {dataset_id:2, yop:2007},
  {dataset_id:3,  yop:1983},
  .
  {dataset_id:578999,yop:1964},
  ]
}

In every approach I tried, including a few approaches using yajl-ruby and json-streamer, my app is killed. When I use parse_chunk,

def parse_very_large_json
        options= {symbolize_keys:false}
        parser = Yajl::Parser.new(options)
        parser.on_parse_complete = method(:print_each_item)

        report_array = parser.parse_chunk(json_string) 
end

def print_each_item report
      report["report-datasets"].each do |dataset|
      puts “this is an element of the array“
      puts dataset
    end
end

parsing happens, but eventually again it is killed.

The problem seems to be that there is not much difference between Yajl::Parser.new().parse and Yajl::Parser.new().parse_chunk in both approaches that are killed.

How can one parse the elements of such a massive JSON array efficiently without killing the rails app?

Upvotes: 3

Views: 463

Answers (0)

Related Questions