Reputation: 3084
My goal is to sort the data frame by 1 column and return a json object as efficiently as possible.
For repoduction, please define the following dataframe:
import pandas as pd
import numpy as np
test = pd.DataFrame(data={'a':[np.random.randint(0,100) for i in range(10000)], 'b':[i + np.random.randint(0,100) for i in range(10000)]})
a b
0 74 89
1 55 52
2 53 39
3 26 21
4 69 34
What I need to do is sort by column a
and then encode the output in a json object. I'm taking the basic approach and doing:
test.sort_values('a', ascending=True, inplace=True) # n log n
data = [{}] # 1
for d in test.itertuples(): # n times
to_append = {'id': d.Index, 'data': {'a': d.a, 'b': d.b}} # 3
data.append(to_append) # 1
So is the cost nlogn + n*4? Are there any more efficient ways of doing it?
Upvotes: 3
Views: 1351
Reputation: 402553
I've noticed that pandas reads and writes JSON slower than pure python. If you're sure of the fact that there are only two columns, you can do something like this:
data = [{'id' : x, 'data' : {'a' : y, 'b' : z}}
for x, (y, z) in zip(test.index, test.values.tolist())]
json.dumps(data)
If you have more columns to worry about, you could do something like:
c = test.columns
data = [{'id' : x, 'data' : dict(zip(c, y))}
for x, *y in zip(test.index, test.values.tolist())]
json.dumps(data)
Or, if you can handle it, do a reset_index
call before saving:
c = test.columns
data = [{'id' : x[0], 'data' : dict(zip(c, x[1:]))}
for x in test.reset_index().values.tolist()]
json.dumps(data)
Upvotes: 1