Reputation: 93
I need some help to improve the performance of the following code.
for object in dict_of_objects.values():
test = pd.Series(object.properties) #properties is a dict
series_list.append(test)
# List comprehension is not really faster than the loop since pd.Series() takes most time
#series_list = [pd.Series(object.properties) for object in dict_of_objects.values()]
# Also very slow
df = pd.DataFrame(series_list)
After timing the code a bit I found out that pd.Series(object.properties)
and pd.DataFrame(series_list)
are very slow - both need around 9s to complete while append needs only 0.4s. As a result, the list comprehension isn't really an improvement since it calls pd.Series(object.properties) as well.
Do you have some suggestions on how to improve the performance of this?
Best, Julz
Upvotes: 0
Views: 565
Reputation: 629
Let's look at some code snippets:
import numpy as np
import pandas as pd
from copy import deepcopy as cp
N_objects = 10
N_samples = 10000
class SimpleClass:
def __init__(self,prop):
self.properties = prop
dict_of_objects = {'obj{}'.format(i):
SimpleClass({
'alice' : np.random.rand(N_samples),
'bob' : np.random.rand(N_samples)
}) for i in range(N_objects)}
def slow_update(dict_of_objects):
series_list = []
for obj in dict_of_objects.values():
test = pd.Series(obj.properties)
series_list.append(test)
return pd.DataFrame(series_list)
def med_update(dict_of_objects):
return pd.DataFrame([pd.Series(obj.properties) for obj in dict_of_objects.values()])
def fast_update(dict_of_objects):
keys = iter(dict_of_objects.values()).__next__().properties.keys()
return pd.DataFrame({k: [obj.properties[k] for obj in dict_of_objects.values()] for k in keys})
And with timings:
>>> %timeit slow_update(dict_of_objects)
2.88 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit med_update(dict_of_objects)
2.86 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit fast_update(dict_of_objects)
344 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The fast update does the following:
__next__
.It's about 8 times faster than most methods.
Edit: as correctly pointed out by @koPytok, fast_update
will not work if each object's properties
attribute has different keys. This is worth bearing in mind if you choose to implement this for something such as a NoSQL database grab -- in MongoDB, documents are not required to share the same fields (here swap document for object, field for key).
Enjoy!
Upvotes: 2
Reputation: 3723
The same result can be achieved, for example, like below:
properties_list = [o.properties for o in dict_of_objects.values()]
df = pd.DataFrame(properties_list).T
Or with dict()
of properties, which requires less operations:
properties_dict = {k: o.properties for k, o in dict_of_objects.items()}
df = pd.DataFrame.from_dict(properties_dict)
Upvotes: 2