Python pandas spending excessive time in garbage collection

Question

I am working on a complex piece of python code which is spends aroubd 40% of the execution time in garbage collection.

 ncalls    tottime  percall  cumtime  percall filename:lineno(function)

 **6028  494.097    0.082  494.097    0.082** {built-in method gc.collect}

 1900  205.709    0.108  205.709    0.108 {built-in method time.sleep}

  778   26.858    0.035  383.476    0.493 func1.py:51(fill_trades)

Is there a way to reduce the number of calls to gc.collect? I tried gc.disable(), but it's effectiveness is limited as Cpython largely uses reference counting. I am using python 3.6.

patricksurry · Accepted Answer

I ran into a similar problem, where my code was spending 90% of time in garbage collection. My function took about 90ms per call in testing, but closer to 1s per call in production. I tracked it down to pandas checking for a quiet form of its SettingWithCopyWarning.

In my case I created a slice of a dataframe like df = pd.DataFrame(data)[fieldlist] and then assigned a new column df['foo'] = .... At this point df._is_copy shows that we have a weakref to the original dataframe, so when we call __setitem__ it tests _check_setitem_copy which then does a full garbage collection cycle to kill of the weakref gc.collect(2).

In production my code is trying to call that function a couple of times per second, with a bunch of large objects (dicts) in a cache, so the garbage collection cycle was very expensive. Fixed by making sure I didn't create a copy in the first place, and performance increased almost 15x :-|

Python pandas spending excessive time in garbage collection

Answers (2)

Related Questions