Reputation: 1680
I am writing a web service which returns objects containing very long lists, encoded in JSON. Of course we want to use iterators rather than Python lists so we can stream the objects from a database; unfortunately, the JSON encoder in the standard library (json.JSONEncoder
) only accepts lists and tuples to be converted to JSON lists (though _iterencode_list
looks like it would actually work on any iterable).
The docstrings suggest overriding default to convert the object to a list, but this means we lose the benefits of streaming. Previously, we overrode a private method, but (as could have been expected) that broke when the encoder was refactored.
What is the best way to serialize iterators as JSON lists in Python in a streaming way?
Upvotes: 19
Views: 5045
Reputation: 9980
I needed exactly this. First approach was override the JSONEncoder.iterencode()
method. However this does not work because as soon as the iterator is not toplevel, the internals of some _iterencode()
function take over.
After some studying of the code, I found a very hacky solution, but it works. Python 3 only, but I'm sure the same magic is possible with python 2 (just other magic-method names):
import collections.abc
import json
import itertools
import sys
import resource
import time
starttime = time.time()
lasttime = None
def log_memory():
if "linux" in sys.platform.lower():
to_MB = 1024
else:
to_MB = 1024 * 1024
print("Memory: %.1f MB, time since start: %.1f sec%s" % (
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / to_MB,
time.time() - starttime,
"; since last call: %.1f sec" % (time.time() - lasttime) if lasttime
else "",
))
globals()["lasttime"] = time.time()
class IterEncoder(json.JSONEncoder):
"""
JSON Encoder that encodes iterators as well.
Write directly to file to use minimal memory
"""
class FakeListIterator(list):
def __init__(self, iterable):
self.iterable = iter(iterable)
try:
self.firstitem = next(self.iterable)
self.truthy = True
except StopIteration:
self.truthy = False
def __iter__(self):
if not self.truthy:
return iter([])
return itertools.chain([self.firstitem], self.iterable)
def __len__(self):
raise NotImplementedError("Fakelist has no length")
def __getitem__(self, i):
raise NotImplementedError("Fakelist has no getitem")
def __setitem__(self, i):
raise NotImplementedError("Fakelist has no setitem")
def __bool__(self):
return self.truthy
def default(self, o):
if isinstance(o, collections.abc.Iterable):
return type(self).FakeListIterator(o)
return super().default(o)
print(json.dumps((i for i in range(10)), cls=IterEncoder))
print(json.dumps((i for i in range(0)), cls=IterEncoder))
print(json.dumps({"a": (i for i in range(10))}, cls=IterEncoder))
print(json.dumps({"a": (i for i in range(0))}, cls=IterEncoder))
log_memory()
print("dumping 10M numbers as incrementally")
with open("/dev/null", "wt") as fp:
json.dump(range(10000000), fp, cls=IterEncoder)
log_memory()
print("dumping 10M numbers built in encoder")
with open("/dev/null", "wt") as fp:
json.dump(list(range(10000000)), fp)
log_memory()
Results:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]
{"a": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
{"a": []}
Memory: 8.4 MB, time since start: 0.0 sec
dumping 10M numbers as incrementally
Memory: 9.0 MB, time since start: 8.6 sec; since last call: 8.6 sec
dumping 10M numbers built in encoder
Memory: 395.5 MB, time since start: 17.1 sec; since last call: 8.5 sec
It's clear to see that the IterEncoder does not need the memory to store 10M ints, while keeping the same encoding speed.
The (hacky) trick is that the _iterencode_list
actually doesn't need any of the list things. It just wants to know if the list is empty (__bool__
) and then get its iterator. However it only gets to this code when isinstance(x, (list, tuple))
returns True. So I'm packaging the iterator into a list-subclass, then disabling all the random-access, getting the first element up front so that I know whether it's empty or not, and feeding back the iterator. Then the default
method returns this fake list in case of an iterator.
Upvotes: 9
Reputation: 11322
Real streaming is not well supported by json
, as it would also mean that the client application too will have to support streaming. There are some java libraries that support reading a streamed json
streams, but is is not very generic. There are also some python bindings for yail
, which is a C library wich supports streaming.
Maybe you can use Yaml
instead of json
. Yaml
is a superset of json. It has better support for streaming on both sides and any json
message will still be valid yaml
.
But in your case it may much be simpler to split your object stream into a stream of separate json
messages.
See also this discussion here, which client libraries support streaming: Is there a streaming API for JSON?
Upvotes: -1
Reputation: 4481
Save this into a module file and import it or paste it directly into your code.
'''
Copied from Python 2.7.8 json.encoder lib, diff follows:
@@ -331,6 +331,8 @@
chunks = _iterencode(value, _current_indent_level)
for chunk in chunks:
yield chunk
+ if first:
+ yield buf
if newline_indent is not None:
_current_indent_level -= 1
yield '\n' + (' ' * (_indent * _current_indent_level))
@@ -427,12 +429,12 @@
yield str(o)
elif isinstance(o, float):
yield _floatstr(o)
- elif isinstance(o, (list, tuple)):
- for chunk in _iterencode_list(o, _current_indent_level):
- yield chunk
elif isinstance(o, dict):
for chunk in _iterencode_dict(o, _current_indent_level):
yield chunk
+ elif hasattr(o, '__iter__'):
+ for chunk in _iterencode_list(o, _current_indent_level):
+ yield chunk
else:
if markers is not None:
markerid = id(o)
'''
from json import encoder
def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
_key_separator, _item_separator, _sort_keys, _skipkeys, _one_shot,
## HACK: hand-optimized bytecode; turn globals into locals
ValueError=ValueError,
basestring=basestring,
dict=dict,
float=float,
id=id,
int=int,
isinstance=isinstance,
list=list,
long=long,
str=str,
tuple=tuple,
):
def _iterencode_list(lst, _current_indent_level):
if not lst:
yield '[]'
return
if markers is not None:
markerid = id(lst)
if markerid in markers:
raise ValueError("Circular reference detected")
markers[markerid] = lst
buf = '['
if _indent is not None:
_current_indent_level += 1
newline_indent = '\n' + (' ' * (_indent * _current_indent_level))
separator = _item_separator + newline_indent
buf += newline_indent
else:
newline_indent = None
separator = _item_separator
first = True
for value in lst:
if first:
first = False
else:
buf = separator
if isinstance(value, basestring):
yield buf + _encoder(value)
elif value is None:
yield buf + 'null'
elif value is True:
yield buf + 'true'
elif value is False:
yield buf + 'false'
elif isinstance(value, (int, long)):
yield buf + str(value)
elif isinstance(value, float):
yield buf + _floatstr(value)
else:
yield buf
if isinstance(value, (list, tuple)):
chunks = _iterencode_list(value, _current_indent_level)
elif isinstance(value, dict):
chunks = _iterencode_dict(value, _current_indent_level)
else:
chunks = _iterencode(value, _current_indent_level)
for chunk in chunks:
yield chunk
if first:
yield buf
if newline_indent is not None:
_current_indent_level -= 1
yield '\n' + (' ' * (_indent * _current_indent_level))
yield ']'
if markers is not None:
del markers[markerid]
def _iterencode_dict(dct, _current_indent_level):
if not dct:
yield '{}'
return
if markers is not None:
markerid = id(dct)
if markerid in markers:
raise ValueError("Circular reference detected")
markers[markerid] = dct
yield '{'
if _indent is not None:
_current_indent_level += 1
newline_indent = '\n' + (' ' * (_indent * _current_indent_level))
item_separator = _item_separator + newline_indent
yield newline_indent
else:
newline_indent = None
item_separator = _item_separator
first = True
if _sort_keys:
items = sorted(dct.items(), key=lambda kv: kv[0])
else:
items = dct.iteritems()
for key, value in items:
if isinstance(key, basestring):
pass
# JavaScript is weakly typed for these, so it makes sense to
# also allow them. Many encoders seem to do something like this.
elif isinstance(key, float):
key = _floatstr(key)
elif key is True:
key = 'true'
elif key is False:
key = 'false'
elif key is None:
key = 'null'
elif isinstance(key, (int, long)):
key = str(key)
elif _skipkeys:
continue
else:
raise TypeError("key " + repr(key) + " is not a string")
if first:
first = False
else:
yield item_separator
yield _encoder(key)
yield _key_separator
if isinstance(value, basestring):
yield _encoder(value)
elif value is None:
yield 'null'
elif value is True:
yield 'true'
elif value is False:
yield 'false'
elif isinstance(value, (int, long)):
yield str(value)
elif isinstance(value, float):
yield _floatstr(value)
else:
if isinstance(value, (list, tuple)):
chunks = _iterencode_list(value, _current_indent_level)
elif isinstance(value, dict):
chunks = _iterencode_dict(value, _current_indent_level)
else:
chunks = _iterencode(value, _current_indent_level)
for chunk in chunks:
yield chunk
if newline_indent is not None:
_current_indent_level -= 1
yield '\n' + (' ' * (_indent * _current_indent_level))
yield '}'
if markers is not None:
del markers[markerid]
def _iterencode(o, _current_indent_level):
if isinstance(o, basestring):
yield _encoder(o)
elif o is None:
yield 'null'
elif o is True:
yield 'true'
elif o is False:
yield 'false'
elif isinstance(o, (int, long)):
yield str(o)
elif isinstance(o, float):
yield _floatstr(o)
elif isinstance(o, dict):
for chunk in _iterencode_dict(o, _current_indent_level):
yield chunk
elif hasattr(o, '__iter__'):
for chunk in _iterencode_list(o, _current_indent_level):
yield chunk
else:
if markers is not None:
markerid = id(o)
if markerid in markers:
raise ValueError("Circular reference detected")
markers[markerid] = o
o = _default(o)
for chunk in _iterencode(o, _current_indent_level):
yield chunk
if markers is not None:
del markers[markerid]
return _iterencode
encoder._make_iterencode = _make_iterencode
Upvotes: 2
Reputation: 7450
Not that simple. The WSGI(which is what most people use) protocol does not support streaming. And the servers that do support it, are violating the spec.
And even if you use a non-compliant server, then you have to use something like ijson. Also take a look at this guy who had the same problem as you http://www.enricozini.org/2011/tips/python-stream-json/
EDIT: Then it all comes down to the client, which I suppose it will be written in Javascript(?). But I don't see how you could construct javascript(or whatever language) objects out of incomplete JSON chuncks. The only thing I can think of, is manually breaking down the long JSON into smaller JSON objects(on the server side) and then stream it, one by one to the client. But this calls for websockets and not for stateless http requests/responses. And if by web service you mean a REST API, then I guess it's not what you want.
Upvotes: -2