Reputation: 1843
I'm having a problem in a large-runtime script. This script is a multithreaded environment, to perform crawling tasks.
In large executions, script's memory consumption become huge, and after profiling memory with guppy hpy, I saw that most of the problem is coming by strings.
I'm not storing so many strings: just get content of htmls into memory, to store them in db. After it, string is not used anymore (the variable containing it is assigned to the next string).
The problem arised because I saw that every new string (with sys.getrefcount) have, at least, 2 references (1 from my var, and 1 internal). It seems that reassigning another value to my var does not remove the internal reference, so the string stills in memory.
What can I do to be sure that strings are garbage collected?
Thank you in advance
EDIT:
1- I'm using Django ORM
2- I'm obtaining all of that strings from 2 sources:
2.1- Directly from socket (urllib2.urlopen(url).read())
2.2- Parsing that responses, and extrating new URIs from every html, and feeding system
SOLVED
Finally, I got the key. The script is part of Django environment, and seems that Django's underground is doing some cache or something similar. I turned off debugging, and all started to work as expected (reused indentifiers seems to delete references to old objects, and that objects become collected by gc).
For anyone who uses some kind of framework layer over python, be aware of configuration: seems that some debug configurations with intensive process can lead to memory leaks
Upvotes: 2
Views: 2154
Reputation: 27585
You say:
I saw that every new string (with sys.getrefcount) have, at least, 2 references
But did you carefully read the description of getrefcount()
? :
sys.getrefcount()
object) Return the reference count of the object. The count returned is generally one higher than you might expect, because it includes the (temporary) reference as an argument to getrefcount().
.
You should explain more about your prohgram.
What is the size of the HTML strings it holds ?
How are they obtained ? Are you sure to close all file's handler , all socket connexions, ....?
Upvotes: 2
Reputation: 64338
You'd need to find out who keeps the "internal" reference to your strings. Perhaps the library you're using to write to DB (you didn't specify how you write to DB). I find objgraph very useful for tasks like this: https://pypi.python.org/pypi/objgraph
E.g.
import objgraph
objgraph.show_backrefs([mystring], filename='a.png')
Upvotes: 0