How to parse JSON with the Oj SAX parser, Saj

Question

I want to parse a 10-20MB JSON file, and figure it's probably a good idea to not parse the entire JSON file at once and cause major memory usage. After looking around it seems like Oj's Saj or ScHandler APIs might be a good fit.

The only problem is that I can't really wrap my head around how to use them, and the documentation doesn't make it much clearer. I've looked at the example in Saj source code, and defined a super simple subclass of Oj::Saj like below:

class MySaj < Oj::Saj
  def hash_start(key)
    p key
  end
end

Used like this:

open(URL) do |contents|
  Oj.saj_parse(handler, contents)
end

And this leads to a lot of keys from my JSON being printed out. But I still have no idea how to actually access the values belonging to the keys I'm printing.

Can I access the hash itself somehow, or how am I supposed to do this?

Michael Gaskill · Accepted Answer

SAX-style parsing is complicated. You have to maintain the state of the parsing, and deal with each state change appropriately.

The hash_start and array_start callbacks, notify your SAX handler that Saj has found the beginning of a hash, and that the next callbacks that occur will be in the context of that hash. Note that hashes may be nested, contain (or be contained within) arrays, or simple values.

Here is a simple Saj handler that parses a very simple JSON object:

require 'oj'

class MySaj < ::Oj::Saj
  def initialize()
    @hash_cnt = 0
    @array_cnt = 0
  end

  def hash_start(key)
    @hash_cnt += 1
    puts "Start-Hash[@hash_cnt]: '#{key}'"
  end

  def hash_end(key)
    @hash_cnt -= 1
    puts "End-Hash[@hash_cnt]: '#{key}'"
  end

  def array_start(key)
    @array_cnt += 1
    puts "Start-Array[@array_cnt]: '#{key}'"
  end

  def array_end(key)
    @array_cnt -= 1
    puts "End-Array[@array_cnt]: '#{key}'"
  end

  def add_value(value, key);
    puts "Value: [#{key}] = '#{value}'"
  end

  def error(message, line, column)
    puts "ERRRORRR: #{line}:#{column}: #{message}"
  end
end

json = '[{ "key1": "abc", "key2": 123}, { "test1": "qwerty", "pi": 3.14159 }]'

cnt = MySaj.new()
Oj.saj_parse(cnt, json)

The results of this basic JSON parsing with Saj gives this result:

Start-Array[@array_cnt]: ''
Start-Hash[@hash_cnt]: ''
Value: [key1] = 'abc'
Value: [key2] = '123'
End-Hash[@hash_cnt]: ''
Start-Hash[@hash_cnt]: ''
Value: [test1] = 'qwerty'
Value: [pi] = '3.14159'
End-Hash[@hash_cnt]: ''
End-Array[@array_cnt]: ''

You may notice that this output is roughly equivalent to one callback per token (omitting ',' and ':'). You essentially have to build into your callbacks the knowledge of what to do with individual JSON elements. Along those lines, you also need to build the hierarchy described by the callbacks. For example, when hash_start is called, push an empty hash on the stack; when hash_end is called, pop the hash or move back one level in the hierarchy.

For example you could have a handler in hash_end that checks to see if this is ending a top-level hash, and when it is, then do something with that hash. Note that you can often not do this with arrays, as the top-level element in a very large number of JSON documents is an array, so you have to determine when the array is the top+1 level array.

If you like writing compiler backends, this is the JSON parsing solution for you. Personally, I've never enjoyed working in Sax, but for large documents, it can be very resource-friendly and highly performant, depending on how well you write the handler. Be prepared for oodles of debugging and slightly mismatched state management, as that's par for the course with Sax-style parsing.

However, you shouldn't be too concerned with 10-20MB JSON, as that's actually not very large. I've processed 80+MB JSON with "regular" Oj (load and dump) quite a lot, and not had a problem with it. Unless you're running on a severely resource-constrained machine, the standard Oj will work well for you.

How to parse JSON with the Oj SAX parser, Saj

Answers (2)

Related Questions