Reputation: 31061
I'm trying to parse a GitHub archive file with yajl-py. I believe the basic format of the file is a stream of JSON objects, so the file itself is not valid JSON, but it contains objects which are.
To test this out, I installed yajl-py
and then used their example parser (from https://github.com/pykler/yajl-py/blob/master/examples/yajl_py_example.py) to try to parse a file:
python yajl_py_example.py < 2012-03-12-0.json
where 2012-03-12-0.json
is one of the GitHub archive files that's been decompressed.
It appears this sort of thing should work from their reference implementation in Ruby. Do the Python packages not handle JSON streams?
By the way, here's the error I get:
yajl.yajl_common.YajlError: parse error: trailing garbage
9478bbc3","type":"PushEvent"}{"repository":{"url":"https://g
(right here) ------^
Upvotes: 5
Views: 1942
Reputation: 199
I know this has been answered, but I prefer the following approach and it does not use any packages. The github dictionary is on a single line for some reason, so you cannot assume a single dictionary per line. This looks like:
{"json-key":"json-val", "sub-dict":{"sub-key":"sub-val"}}{"json-key2":"json-val2", "sub-dict2":{"sub-key2":"sub-val2"}}
I decided to create a function which fetches one dictionary at a time. It returns json as a string.
def read_next_dictionary(f):
depth = 0
json_str = ""
while True:
c = f.read(1)
if not c:
break #EOF
json_str += str(c)
if c == '{':
depth += 1
elif c == '}':
depth -= 1
if depth == 0:
break
return json_str
I used this function to loop through the Github archive with a while loop:
arr_of_dicts = []
f = open(file_path)
while True:
json_as_str = read_next_dictionary(f)
try:
json_dict = json.loads(json_as_str)
arr_of_dicts.append(json_dict)
except:
break # exception on loading json to end loop
pprint.pprint(arr_of_dicts)
This works on the dataset post here: http://www.githubarchive.org/ (after gunzip)
Upvotes: 1
Reputation: 898
As a workaround you can split the GitHub Archive files into lines and then parse each line as json:
import json
with open('2013-05-31-10.json') as f:
lines = f.read().splitlines()
for line in lines:
rec = json.loads(line)
...
Upvotes: -1
Reputation: 14865
The example does not enable any of the Yajl extra features, for what you are looking for you need to enable allow_multiple_values
flag on the parser. Here is what you need to modify to the basic example to have it parse your file.
--- a/examples/yajl_py_example.py
+++ b/examples/yajl_py_example.py
@@ -37,6 +37,7 @@ class ContentHandler(YajlContentHandler):
def main(args):
parser = YajlParser(ContentHandler())
+ parser.allow_multiple_values = True
if args:
for fn in args:
f = open(fn)
Yajl-Py is a thin wrapper around yajl, so you can use all the features Yajl provides. Here are all the flags that yajl provides that you can enable:
yajl_allow_comments
yajl_dont_validate_strings
yajl_allow_trailing_garbage
yajl_allow_multiple_values
yajl_allow_partial_values
To turn these on in yajl-py you do the following:
parser = YajlParser(ContentHandler())
# enabling these features, note that to make it more pythonic, the prefix `yajl_` was removed
parser.allow_comments = True
parser.dont_validate_strings = True
parser.allow_trailing_garbage = True
parser.allow_multiple_values = True
parser.allow_partial_values = True
# then go ahead and parse
parser.parse()
Upvotes: 1
Reputation: 9611
You need to use a stream parser to read the data. Yajl supports stream parsing, which allows you to read one object at a time from a file/stream. Having said that, it doesn't look like Python has working bindings for Yajl..
py-yajl has iterload
commented out, not sure why: https://github.com/rtyler/py-yajl/commit/a618f66005e9798af848c15d9aa35c60331e6687#L1R264
Not a Python solution, but you can use Ruby bindings to read in the data and emit it in a format you need:
# gem install yajl-ruby require 'open-uri' require 'zlib' require 'yajl' gz = open('http://data.githubarchive.org/2012-03-11-12.json.gz') js = Zlib::GzipReader.new(gz).read Yajl::Parser.parse(js) do |event| print event end
Upvotes: 4