null
null

Reputation: 1443

Why does text representation as a list consume so much memory?

I have a 335 MB large text file. The entire text is tokenized. Each token is separated by a whitespace. I want to represent each sentence as a list of words while the entire text is a list of sentences. This means that I'll get a list of lists.

I use this simple peace of code to load the text into my main memory:

def get_tokenized_text(file_name):
    tokens = list()
    with open(file_name,'rt') as f:
        sentences = f.readlines()

    return [sent.strip().split(' ') for sent in sentences]

Unfortunately, this method consumes so much memory that my laptop always crashes. I have 4 GB RAM, but it is congested after about five seconds.

Why? The text should occupy about 335 MB. Even if I'd been generous and I'd approved let's say four times as much memory just for administration stuff, there is no reason for memory congestion. Is there any memory leak that I oversee right now?

Upvotes: 2

Views: 126

Answers (4)

John D
John D

Reputation: 1637

My first answer attempted to reduce memory usage by not keeping intermediate lists in memory at the same time. But that still did not manage to squeeze the whole data structure into 4GB of RAM.

With this approach, using a 40MB text file made up of Project Gutenberg books as test data, the data requirement is reduced from 270 to 55 MB. A 355 MB input file would then take an estimated 500MB memory, which hopefully would fit.

This approach builds a dictionary of unique words and assigns a unique integer token to each one (word_dict). Then the list of sentences word_tokens uses the integer token instead of the word itself. Then word_dict has its keys and values swapped so that the integer tokens in word_tokens can be used to lookup the corresponding word.

I'm using 32-bit Python which uses a lot less memory then 64-bit Python, because the pointers are half the size.

To get the total size of containers like list & dictionary, I've used code from http://code.activestate.com/recipes/577504/ by Raymond Hettinger. It includes not just the container itself but sub-containers and the bottom level items they point to.

import sys, os, fnmatch, datetime, time, re

# Original approach
def get_tokenized_text(file_name):
    words = []
    f = open(file_name,'rt')
    for line in f:
        words.append( line.strip().split(' ') )
    return words

# Two step approach
# 1. Build a dictionary of unique words in the file indexed with an integer

def build_dict(file_name):
    dict = {}
    n = 0
    f = open(file_name,'rt')
    for line in f:
        words = line.strip().split(' ')
        for w in words:
            if not w in dict:
                dict[w] = n
                n = n + 1
    return dict

# 2. Read the file again and build list of sentence-words but using the integer indexes instead of the word itself

def read_with_dict(file_name):
    tokens = []
    f = open(file_name,'rt')
    for line in f:
        words = line.strip().split(' ')
        tokens.append( dict[w] for w in words )
    return tokens


# Adapted from http://code.activestate.com/recipes/577504/ by Raymond Hettinger 
from itertools import chain
from collections import deque

def total_size(o, handlers={}):
    """ Returns the approximate memory footprint an object and all of its contents.

    Automatically finds the contents of the following builtin containers and
    their subclasses:  tuple, list, deque, dict, set and frozenset.
    To search other containers, add handlers to iterate over their contents:

        handlers = {SomeContainerClass: iter,
                    OtherContainerClass: OtherContainerClass.get_elements}

    """
    dict_handler = lambda d: chain.from_iterable(d.items())
    all_handlers = {tuple: iter,
                    list: iter,
                    deque: iter,
                    dict: dict_handler,
                    set: iter,
                    frozenset: iter,
                   }
    all_handlers.update(handlers)     # user handlers take precedence
    seen = set()                      # track which object id's have already been seen
    default_size = sys.getsizeof(0)       # estimate sizeof object without __sizeof__

    def sizeof(o):
        if id(o) in seen:       # do not double count the same object
            return 0
        seen.add(id(o))
        s = sys.getsizeof(o, default_size)

        for typ, handler in all_handlers.items():
            if isinstance(o, typ):
                s += sum(map(sizeof, handler(o)))
                break
        return s
    return sizeof(o)

# Display your Python configurstion? 32-bit Python takes about half the memory of 64-bit
import platform
print platform.architecture(), sys.maxsize          # ('32bit', 'WindowsPE') 2147483647

file_name = 'LargeTextTest40.txt'                   # 41,573,429 bytes

# I ran this only for a size comparison - don't run it on your machine
# words = get_tokenized_text(file_name)
# print len(words), total_size(words)               # 962,632  268,314,991

word_dict = build_dict(file_name)
print len(word_dict), total_size(word_dict)         # 185,980  13,885,970

word_tokens = read_with_dict(file_name)
print len(word_tokens), total_size(word_tokens)     # 962,632  42,370,804

# Reverse the dictionary by swapping key and value so the integer token can be used to lookup corresponding word
word_dict.update( dict((word_dict[k], k) for k in word_dict) )

Upvotes: 0

John D
John D

Reputation: 1637

You are keeping multiple representations of the data in memory at the same time. The file buffer in readlines(), also sentences and again when you are building the list to return. To reduce memory, process the file a line at a time. Only words will then hold the entire contents of the file.

def get_tokenized_text(file_name):
    words = []
    f = open(file_name,'rt')
    for line in f:
        words.extend( x for x in line.strip().split(' ') if x not in words)
    return words

words = get_tokenized_text('book.txt')
print words

Upvotes: 0

John Bollinger
John Bollinger

Reputation: 181149

Why? The text should occupy about 335 MB.

Supposing that the text is encoded in UTF-8 or one of the various single-byte encodings -- which is likely -- the text itself does occupy a bit more than 335 MB in Python 2, but at least twice as much and maybe four times as much in Python 3, depending on your implementation. This is because Python 3 strings are Unicode strings by default, and they are represented internally with either two or four bytes per character.

Even if I'd been generous and I'd approved let's say four times as much memory just for administration stuff, there is no reason for memory congestion.

But there is. Each Python object has relatively substantial overhead. In CPython 3.4, for example, there is a refcount, a pointer to a type object, a couple of additional pointers linking the objects together into a doubly-linked list, and type-specific additional data. Almost all of that is overhead. Ignoring the type-specific data, just the three pointers and the refcount represent 32 bytes of overhead per object in a 64-bit build.

Strings have an additional length, hashcode, data pointer, and flags, for about 24 more bytes per object (again assuming a 64-bit build).

If your words average 6 characters then each one takes about 6 bytes in your text file, but about 68 bytes as a Python object (maybe as little as 40-ish bytes in a 32-bit Python). That doesn't count the overhead of the lists, which likely add at least 8 bytes per word and 8 more per sentence.

So yes, an expansion of a factor of 12 or more does not seem at all unlikely.

Is there any memory leak that I oversee right now?

Unlikely. Python does a pretty good job of tracking objects and collecting garbage. You do not generally see memory leaks in pure Python codes.

Upvotes: 1

user1238367
user1238367

Reputation: 525

Lists and strings are objects and objects have properties that take memory space. You can check the size of the objects and the overhead with sys.getsizeof:

>>> sys.getsizeof('')
49
>>> sys.getsizeof('abcd')
53
>>> sys.getsizeof([])
64
>>> sys.getsizeof(['a'])
72
>>> sys.getsizeof(['a', 'b'])
80

Upvotes: 1

Related Questions