GRS
GRS

Reputation: 3084

Optimising the cost of pandas dataframe to json

My goal is to sort the data frame by 1 column and return a json object as efficiently as possible.

For repoduction, please define the following dataframe:

import pandas as pd
import numpy as np
test = pd.DataFrame(data={'a':[np.random.randint(0,100) for i in range(10000)], 'b':[i + np.random.randint(0,100) for i in range(10000)]})

       a      b
0     74     89
1     55     52
2     53     39
3     26     21
4     69     34

What I need to do is sort by column a and then encode the output in a json object. I'm taking the basic approach and doing:

test.sort_values('a', ascending=True, inplace=True) # n log n
data = [{}] # 1
for d in test.itertuples(): # n times
    to_append = {'id': d.Index, 'data': {'a': d.a, 'b': d.b}} # 3 
    data.append(to_append) # 1

So is the cost nlogn + n*4? Are there any more efficient ways of doing it?

Upvotes: 3

Views: 1351

Answers (1)

cs95
cs95

Reputation: 402553

I've noticed that pandas reads and writes JSON slower than pure python. If you're sure of the fact that there are only two columns, you can do something like this:

data = [{'id' : x, 'data' : {'a' : y, 'b' : z}} 
            for x, (y, z) in zip(test.index, test.values.tolist())] 
json.dumps(data)

If you have more columns to worry about, you could do something like:

c = test.columns
data = [{'id' : x, 'data' : dict(zip(c, y))} 
            for x, *y in zip(test.index, test.values.tolist())]
json.dumps(data)

Or, if you can handle it, do a reset_index call before saving:

c = test.columns
data = [{'id' : x[0], 'data' : dict(zip(c, x[1:]))} 
            for x in test.reset_index().values.tolist()]
json.dumps(data)    

Upvotes: 1

Related Questions