Reputation: 5695
I want to parse a 10-20MB JSON file, and figure it's probably a good idea to not parse the entire JSON file at once and cause major memory usage. After looking around it seems like Oj's Saj or ScHandler APIs might be a good fit.
The only problem is that I can't really wrap my head around how to use them, and the documentation doesn't make it much clearer. I've looked at the example in Saj source code, and defined a super simple subclass of Oj::Saj like below:
class MySaj < Oj::Saj
def hash_start(key)
p key
end
end
Used like this:
open(URL) do |contents|
Oj.saj_parse(handler, contents)
end
And this leads to a lot of keys from my JSON being printed out. But I still have no idea how to actually access the values belonging to the keys I'm printing.
Can I access the hash itself somehow, or how am I supposed to do this?
Upvotes: 2
Views: 1186
Reputation: 459
Saj is a streaming parser. What that means, in practice, is that it doesn't know a file's contents in their entirety and parses them whole — it instead notifies you of parse events as it encounters them. Your thinking is solid: the larger the file, the more you benefit from parsing in that manner if you wish to pick and choose from it.
hash_start
is one such event, fired when Oj sees the beginning of an Object (which will become a Hash
in Ruby land).
Take this JSON for instance:
{
"student-1": {
"name": "John Doe",
"age": 42,
"knownAliases": ["Blabby Joe", "Stack Underflow"],
"trainingGrades": {
"Advanced Zumba Dancing": "A+",
"Introduction to Twitter Arguments": "C-"
}
},
"student-2": {
"name": "Rebecca Melecca",
"age": 26,
"knownAliases": ["Booger Becca", "Tanktop Terror"],
"trainingGrades": {
"Intermediate Groin Kickery": "A+",
"Advanced Quantum Mechanics": "A+"
}
}
And the following parser:
class StudentParser < Oj::Saj
def hash_start(key)
puts "hash_start(#{key.inspect})"
end
def hash_end(key)
puts "hash_end(#{key.inspect})"
end
def array_start(key)
puts "array_start(#{key.inspect})"
end
def array_end(key)
puts "array_end(#{key.inspect})"
end
def add_value(value, key)
puts "add_value(#{value.inspect}, #{key.inspect})"
end
end
And you'll get the following sequence of events:
hash_start(nil)
hash_start("student-1")
add_value("John Doe", "name")
add_value(42, "age")
array_start("knownAliases")
add_value("Blabby Joe", nil)
add_value("Stack Underflow", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Advanced Zumba Dancing")
add_value("C-", "Introduction to Twitter Arguments")
hash_end("trainingGrades")
hash_end("student-1")
hash_start("student-2")
add_value("Rebecca Melecca", "name")
add_value(26, "age")
array_start("knownAliases")
add_value("Booger Becca", nil)
add_value("Tanktop Terror", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Intermediate Groin Kickery")
add_value("A+", "Advanced Quantum Mechanics")
hash_end("trainingGrades")
hash_end("student-2")
hash_end(nil)
When you see hash_start(nil)
, it means the parser has found a top-level object (that very first opening brace). Conversely, hash_end(nil)
means that top-level object has been closed, and its innards properly parsed (i.e. no parsing erros have been found).
Parsing in this manner means you have to keep track of nesting, if that's meaningful to you, of adding keys and values at the right value, et cetera. That makes it annoying and hard, but worthwhile if you wish to carve out bits of a large file without committing everything to memory.
Upvotes: 3
Reputation: 8042
SAX-style parsing is complicated. You have to maintain the state of the parsing, and deal with each state change appropriately.
The hash_start
and array_start
callbacks, notify your SAX handler that Saj has found the beginning of a hash, and that the next callbacks that occur will be in the context of that hash. Note that hashes may be nested, contain (or be contained within) arrays, or simple values.
Here is a simple Saj handler that parses a very simple JSON object:
require 'oj'
class MySaj < ::Oj::Saj
def initialize()
@hash_cnt = 0
@array_cnt = 0
end
def hash_start(key)
@hash_cnt += 1
puts "Start-Hash[@hash_cnt]: '#{key}'"
end
def hash_end(key)
@hash_cnt -= 1
puts "End-Hash[@hash_cnt]: '#{key}'"
end
def array_start(key)
@array_cnt += 1
puts "Start-Array[@array_cnt]: '#{key}'"
end
def array_end(key)
@array_cnt -= 1
puts "End-Array[@array_cnt]: '#{key}'"
end
def add_value(value, key);
puts "Value: [#{key}] = '#{value}'"
end
def error(message, line, column)
puts "ERRRORRR: #{line}:#{column}: #{message}"
end
end
json = '[{ "key1": "abc", "key2": 123}, { "test1": "qwerty", "pi": 3.14159 }]'
cnt = MySaj.new()
Oj.saj_parse(cnt, json)
The results of this basic JSON parsing with Saj gives this result:
Start-Array[@array_cnt]: ''
Start-Hash[@hash_cnt]: ''
Value: [key1] = 'abc'
Value: [key2] = '123'
End-Hash[@hash_cnt]: ''
Start-Hash[@hash_cnt]: ''
Value: [test1] = 'qwerty'
Value: [pi] = '3.14159'
End-Hash[@hash_cnt]: ''
End-Array[@array_cnt]: ''
You may notice that this output is roughly equivalent to one callback per token (omitting ',' and ':'). You essentially have to build into your callbacks the knowledge of what to do with individual JSON elements. Along those lines, you also need to build the hierarchy described by the callbacks. For example, when hash_start
is called, push an empty hash on the stack; when hash_end
is called, pop the hash or move back one level in the hierarchy.
For example you could have a handler in hash_end
that checks to see if this is ending a top-level hash, and when it is, then do something with that hash. Note that you can often not do this with arrays, as the top-level element in a very large number of JSON documents is an array, so you have to determine when the array is the top+1 level array.
If you like writing compiler backends, this is the JSON parsing solution for you. Personally, I've never enjoyed working in Sax, but for large documents, it can be very resource-friendly and highly performant, depending on how well you write the handler. Be prepared for oodles of debugging and slightly mismatched state management, as that's par for the course with Sax-style parsing.
However, you shouldn't be too concerned with 10-20MB JSON, as that's actually not very large. I've processed 80+MB JSON with "regular" Oj (load
and dump
) quite a lot, and not had a problem with it. Unless you're running on a severely resource-constrained machine, the standard Oj will work well for you.
Upvotes: 4