user989501
user989501

Reputation: 1843

Python: how to garbage collect strings

I'm having a problem in a large-runtime script. This script is a multithreaded environment, to perform crawling tasks.

In large executions, script's memory consumption become huge, and after profiling memory with guppy hpy, I saw that most of the problem is coming by strings.

I'm not storing so many strings: just get content of htmls into memory, to store them in db. After it, string is not used anymore (the variable containing it is assigned to the next string).

The problem arised because I saw that every new string (with sys.getrefcount) have, at least, 2 references (1 from my var, and 1 internal). It seems that reassigning another value to my var does not remove the internal reference, so the string stills in memory.

What can I do to be sure that strings are garbage collected?

Thank you in advance

EDIT:

1- I'm using Django ORM

2- I'm obtaining all of that strings from 2 sources:

2.1- Directly from socket (urllib2.urlopen(url).read())

2.2- Parsing that responses, and extrating new URIs from every html, and feeding system

SOLVED

Finally, I got the key. The script is part of Django environment, and seems that Django's underground is doing some cache or something similar. I turned off debugging, and all started to work as expected (reused indentifiers seems to delete references to old objects, and that objects become collected by gc).

For anyone who uses some kind of framework layer over python, be aware of configuration: seems that some debug configurations with intensive process can lead to memory leaks

Upvotes: 2

Views: 2154

Answers (2)

eyquem
eyquem

Reputation: 27585

You say:
I saw that every new string (with sys.getrefcount) have, at least, 2 references

But did you carefully read the description of getrefcount() ? :

sys.getrefcount()

object) Return the reference count of the object. The count returned is generally one higher than you might expect, because it includes the (temporary) reference as an argument to getrefcount().

.

You should explain more about your prohgram.

What is the size of the HTML strings it holds ?
How are they obtained ? Are you sure to close all file's handler , all socket connexions, ....?

Upvotes: 2

shx2
shx2

Reputation: 64338

You'd need to find out who keeps the "internal" reference to your strings. Perhaps the library you're using to write to DB (you didn't specify how you write to DB). I find objgraph very useful for tasks like this: https://pypi.python.org/pypi/objgraph

E.g.

import objgraph
objgraph.show_backrefs([mystring], filename='a.png')

Upvotes: 0

Related Questions