Reputation: 2102
I have created data objects that are comprised of (among other things), of pandas
objects like DataFrame
s and Panel
s. I'm looking to serialize these objects into json
, and speed is a primary consideration.
pandas.Panel
Say for instance I have a panel like so:
In [54]: panel = pandas.Panel(
numpy.random.randn(5, 100, 10),
items = ['a', 'b', 'c', 'd', 'e'],
major_axis = pandas.DatetimeIndex(start = '01/01/2000',
freq = 'b',
periods = 100
),
minor_axis = ['z', 'y', 'x', 'v', 'u', 't', 's', 'r', 'q', 'o']
)
In [64]: panel
Out[64]:
<class 'pandas.core.panel.Panel'>
Dimensions: 5 (items) x 100 (major_axis) x 10 (minor_axis)
Items axis: a to e
Major_axis axis: 2000-01-03 00:00:00 to 2000-05-19 00:00:00
Minor_axis axis: z to o
And I'd like to turn this panel
into flattened json
NOTE: I'm doing this with more complicated objects, but the overall logic of looping over keys and generating json
data for each key is the same
I can write a quick and dirty panel_to_json()
function like so:
def panel_to_json(panel):
d = {'__type__' : 'panel'}
for item in panel.items:
tmp = panel.loc[item ,: , :].to_json()
d[item] = eval(tmp)
return json.dumps(d)
In [58]: tmp = panel_to_json(panel)
In [59]: tmp[:100]
Out[59]: '{"a": {"q": {"948931200000": -0.5586319118, "951955200000": 0.6820748888, "949363200000": -0.0153867'
Which gets me the correct result, the problem is the eval
usage is very costly. For example, if I remove the eval
and just deal with the smattering of \\
that result from panel_no_eval_to_json
function here:
def panel_no_eval_to_json(panel):
d = {'__type__' : 'panel'}
for item in panel.items:
d[item] = panel.loc[item ,: , :].to_json()
return json.dumps(d)
In [60]: tmp = panel_no_eval_to_json(panel)
In [61]: tmp[:100]
Out[61]: '{"a": "{\\"z\\":{\\"946857600000\\":1.0233515965,\\"946944000000\\":-1.1333560575,\\"947030400000\\":-0.0072'
The difference in speed is substantial, checkout their %timeit
values!!:
In [62]: %timeit panel_no_eval_to_json(panel)
100 loops, best of 3: 3.55 ms per loop
In [63]: %timeit panel_to_json(panel)
10 loops, best of 3: 41.1 ms per loop
So my final goal would be to loop through the Panel
(or my object, that has different keys / attributes, many of which are Panel
's and DataFrame
s), and merge the json
streams created from invoking to_json()
into an aggregated json
stream (which would actually be the flattening data representation of my data object) just as is performed by using the panel_to_json
function above (the one with eval
).
My key goals are:
pandas to_json
functionality json_stream_merger
, but clearly this has already been done, right?)Upvotes: 2
Views: 761
Reputation: 2102
In the end, the fastest way was to write a simple string concat
-er. Here were the two best solutions, (one provided by @Skorp)) and their respective %timeit
times in graphical form
def panel_to_json_string(panel):
def __merge_stream(key, stream):
return '"' + key + '"' + ': ' + stream + ', '
try:
stream = '{ "__type__": "panel", '
for item in panel.items:
stream += __merge_stream(item, panel.loc[item, :, :].to_json())
# take out extra last comma
stream = stream[:-2]
# add the final paren
stream += '}'
except:
logging.exception('Panel Encoding did not work')
return stream
def panel_to_json_loads(panel):
try:
d = {'__type__' : 'panel'}
for item in panel.items:
d[item] = json.loads(panel.loc[item ,: , :].to_json())
return json.dumps(d)
except:
logging.exception('Panel Encoding did not work')
import timeit
import pandas
import numpy
setup = ("import strat_check.io as sio; import pandas; import numpy;"
"panel = pandas.Panel(numpy.random.randn(5, {0}, 4), "
"items = ['a', 'b', 'c', 'd', 'e'], "
"major_axis = pandas.DatetimeIndex(start = '01/01/1990',"
"freq = 's', "
"periods = {0}), "
"minor_axis = numpy.arange(4))")
vals = [10, 100, 1000, 10000, 100000]
d = {'string-merge': [],
'loads-dumps': []
}
for n in vals:
number = 10
d['string-merge'].append(
timeit.timeit(stmt = 'panel_to_json_string(panel)',
setup = setup.format(n),
number = number)
)
d['loads-dumps'].append(
timeit.timeit(stmt = 'sio.panel_to_json_loads(panel)',
setup = setup.format(n),
number = number)
)
Upvotes: 1
Reputation: 2552
Have you considered merging the dataframes and then "to_json" that frame? you could use pd.merge(masterdf, panel[item], how="outer"). Just a thought, I haven't worked with panels so not sure if the json representation would be accurate. You could also try using this in your loop. You should also consider using the iteritems() method.
masterdf = pd.concat([masterdf, panel[item]], axis = 1, keys =[list(masterdf.columns.values), item]) and then make that into a json.
You may be even able to do something sexier like:
pd.concat([lamda x: x for panel.items], axis = 1, keys = list(panel.keys())).to_json
Upvotes: 0
Reputation: 2552
if all you need to do is get rid of "\\" you could use
.str.strip("\\") #or
`.str.replace("\\","") `
you should read up on string methods, vectorized string methods and regular expressions. Here's pandas specific info link:
http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Upvotes: 0