parmatma
parmatma

Reputation: 35

Python pandas spending excessive time in garbage collection

I am working on a complex piece of python code which is spends aroubd 40% of the execution time in garbage collection.

 ncalls    tottime  percall  cumtime  percall filename:lineno(function)

 **6028  494.097    0.082  494.097    0.082** {built-in method gc.collect}

 1900  205.709    0.108  205.709    0.108 {built-in method time.sleep}

  778   26.858    0.035  383.476    0.493 func1.py:51(fill_trades)

Is there a way to reduce the number of calls to gc.collect? I tried gc.disable(), but it's effectiveness is limited as Cpython largely uses reference counting. I am using python 3.6.

Upvotes: 1

Views: 1584

Answers (2)

patricksurry
patricksurry

Reputation: 5868

I ran into a similar problem, where my code was spending 90% of time in garbage collection. My function took about 90ms per call in testing, but closer to 1s per call in production. I tracked it down to pandas checking for a quiet form of its SettingWithCopyWarning.

In my case I created a slice of a dataframe like df = pd.DataFrame(data)[fieldlist] and then assigned a new column df['foo'] = .... At this point df._is_copy shows that we have a weakref to the original dataframe, so when we call __setitem__ it tests _check_setitem_copy which then does a full garbage collection cycle to kill of the weakref gc.collect(2).

In production my code is trying to call that function a couple of times per second, with a bunch of large objects (dicts) in a cache, so the garbage collection cycle was very expensive. Fixed by making sure I didn't create a copy in the first place, and performance increased almost 15x :-|

Upvotes: 3

viraptor
viraptor

Reputation: 34145

This is not really possible to answer properly without seeing the code. There are some generic tips you can use to improve the situation though.

The main one is: Limit the number of allocations. Are you constantly repacking some values in tiny wrappers which aren't useful? Are you copying parts of your strings a lot? Are you doing a lot of message parsing which copies data? Find what allocates memory most often and improve it. https://pypi.python.org/pypi/memory_profiler may be helpful here.

Situation-specific fixes:

  • Are you doing a lot of math-intensive operations? Maybe moving to something like numpy would help since you can use real, mutable, typed arrays rather than lists.
  • Do you have a lot of data processing code? You may be able to annotate types on it and compile the module using cython to remove the need of wrapping values into python objects.
  • For raw memory (parsing / file processing / ...) you can save some allocations by using memoryviews: https://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews

And finally - are you sure the collect time is that problematic? From the trace, you can see that the second place on the list is time.sleep. If your gc.collect takes 40% of runtime, then time.sleep takes 16% - why don't you trigger collection at that point instead? You're explicitly sleeping anyway.

Edit: Also, I do believe you're calling gc.collect explicitly somewhere. The call does not appear on pstats output automatically. To find out where, try:

your_pstats_object.print_callers('gc.collect')

Upvotes: 2

Related Questions