soshial
soshial

Reputation: 6808

Python memory leak in big data structes (lists, dicts) -- what could be the reason?

The code is extremely simple. It shouldn't have any leaks since all is done inside the function. And nothing is returned. I have a function which goes over all lines in a file (~20 MiB) and puts them all into a list.
Mentioned function:

def read_art_file(filename, path_to_dir):
    import codecs
    corpus = []
    corpus_file = codecs.open(path_to_dir + filename, 'r', 'iso-8859-15')
    newline = corpus_file.readline().strip()
    while newline != '':
        # we put into @article a @newline of file and some other info
        # (i left those lists blank for readability)
        article = [newline, [], [], [], [], [], [], [], [], [], [], [], []]
        corpus.append(article)
        del newline
        del article
        newline = corpus_file.readline().strip()
    memory_usage('inside function')
    for article in corpus:
        for word in article:
            del word
        del article
    del corpus
    corpus_file.close()
    memory_usage('inside: after corp deleted')
    return

Here is the main code:

memory_usage('START')
path_to_dir = '/home/soshial/internship/training_data/parser_output/'
read_art_file('accounting.n.txt.wpr.art', path_to_dir)
memory_usage('outside func')
time.sleep(5)
memory_usage('END')

All memory_usage just prints amount of KiB allocated by the script.

Executing the script

If I run the script, it gives me:

START memory: 6088 KiB
inside memory: 393752 KiB (20 MiB file + lists occupy 400 MiB)
inside: after corp deleted memory: 43360 KiB
outside func memory: 34300 KiB (34300-6088= 28 MiB leaked)
FINISH memory: 34300 KiB

Executing without lists

And if I do absolutely the same thing, but with appending article to the corpus commented out:

article = [newline, [], [], [], [], [], ...]  # we still assign data to `article`
# corpus.append(article)  # we don't have this string during second execution

This way output gives me:

START memory: 6076 KiB
inside memory: 6076 KiB
inside: after corp deleted memory: 6076 KiB
outside func memory: 6076 KiB
FINISH memory: 6076 KiB

QUESTION:

Hence, this way all memory is being freed. I need to have all memory freed since I'm going to process hundreds of such files.
Is it that I do something wrong or it is the CPython interpreter bug?

UPD. This is how I check memory consumption (taken from some other stackoverflow question):

def memory_usage(text = ''):
    """Memory usage of the current process in kilobytes."""
    status = None
    result = {'peak': 0, 'rss': 0}
    try:
        # This will only work on systems with a /proc file system
        # (like Linux).
        status = open('/proc/self/status')
        for line in status:
            parts = line.split()
            key = parts[0][2:-1].lower()
            if key in result:
                result[key] = int(parts[1])
    finally:
        if status is not None:
            status.close()
    print('>', text, 'memory:', result['rss'], 'KiB  ')
    return

Upvotes: 9

Views: 9875

Answers (2)

chepner
chepner

Reputation: 532093

This loop

for article in corpus:
    for word in article:
        del word
    del article

does not free memory. del word simply decrements the reference count of the object referenced by the name word. However, your loop increments the reference count of each object by one when the loop variable is set. In other words, there is no net change in the reference count of any object due to this loop.

When you comment out the call to corpus.append, you are not keeping any references to objects read from the file from one iteration to the next, so the interpreter is free to deallocate the memory earlier, which accounts for the decrease in memory you observe.

Upvotes: 1

mgilson
mgilson

Reputation: 310117

Please note that python never guarantees that any memory that your code uses will actually get returned to the OS. All that garbage collection guarantees is that the memory used by an object which has been collected is free to be used by another object at some future time.

From what I've read1 about the Cpython implementation of the memory allocator, memory gets allocated in "pools" for efficiency. When a pool is full, python will allocate a new pool. If a pool contains only dead objects, Cpython actually free the memory associated with that pool, but otherwise it doesn't. This can lead to multiple partially full pools hanging around after a function or whatever. However, this doesn't really mean it is a "memory leak". (Cpython still knows about the memory and could potentially free it at some later time).

1I'm not a python dev, so these details are likely to be incorrect or at least incomplete

Upvotes: 8

Related Questions