user2998454
user2998454

Reputation: 155

Memory consumption of python lists of lists

A code I was recently working on was found to be using around 200MB of memory to run, and I'm stumped as to why it would need that much.

Basically it mapped a text file onto a list where each character in the file was its own list containing the character and how often it has shown up so far (starting from zero) as its two items.

So 'abbac...' would be [['a','0'],['b','0'],['b','1'],['a','1'],['c','0'],...]

For a text file 1 million characters long, it used 200MB.

Is this reasonable or was it something else my code was doing? If it is reasonable, was it because of the high number of lists? Would [a,0,b,0,b,1,a,1,c,0...] take up substantially less space?

Upvotes: 5

Views: 1057

Answers (2)

ODiogoSilva
ODiogoSilva

Reputation: 2414

If you do not need the list itself, then I fully subscribe @Lattyware's solution of using a generator.

However, if that's not an option then perhaps you could compress the data in your list without loss of information by storing only the positions for each character in the file.

import random
import string

def track_char(s):
    # Make sure all characters have the same case
    s = s.lower()
    d = dict((k, []) for k in set(s))
    for position, char in enumerate(s):
         d[char].append(position)
    return d

st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))
d = track_char(st)

len(d["a"])

# Total number of occurrences of character 2
for char, vals in d.items():
    if 2 in vals:
         print("Character %s has %s occurrences" % (char,len(d[char]))
Character C has 1878 occurrences

# Number of occurrences of character 2 so far
for char, vals in d.items():
    if 2 in vals:
        print("Character %s has %s occurrences so far" % (char, len([x for x in d[char] if x <= 2))
Character C has 1 occurrences so far

This way there is no need to duplicate the character string each time there is an occurrence, and you maintain the information of all their occurrences.

To compare the object size of your original list or this approach, here's a test

import random
import string
from sys import getsizeof

# random generation of a string with 50k characters
st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))

# Function that returns the original list for this string
def original_track(s):
    l = []
    for position, char in enumerate(s):
        l.append([char, position])
    return l

# Testing sizes
original_list = original_track(st)
dict_format = track_char(st)

getsizeof(original_list)
406496
getsizeof(dict_format)
1632

As you can see, the dict_format is roughly 250x times smaller in size. However this difference in sizes should be more pronounced in larger strings.

Upvotes: 2

Gareth Latty
Gareth Latty

Reputation: 89097

When it comes to memory use and lists, one of the best ways to reduce memory usage is to avoid lists at all - Python has great support for iterators in the form of generators. If you can produce a generator instead of constructing a list, you should be able to do something like this with very little memory usage. Of course, it depends what you are doing with the data afterwards (say you are writing this structure out to a file, you could do so piece by piece and not store the entire thing at once).

from collections import Counter

def charactersWithCounts():
    seen = Counter()
    for character in data:
        yield (character, seen[character])
        seen[character] += 1

Upvotes: 1

Related Questions