didi_X8
didi_X8

Reputation: 5068

Python: size of strings in memory

Consider the following code:

arr = []
for (str, id, flag) in some_data:
    arr.append((str, id, flag))

Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements. What will the memory requirement of such a structure be?

May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?

Upvotes: 23

Views: 26292

Answers (3)

at54321
at54321

Reputation: 11756

In recent Python 3 (64-bit) versions, string instances take up 49+ bytes. But also keep in mind that if you use non-ASCII characters, the memory usage jumps up even more:

>>> sys.getsizeof('t')
50
>>> sys.getsizeof('я')
76

Notice how even if one character in a string is non-ASCII, all other characters will take up more space (2 or 4 bytes each):

>>> sys.getsizeof('t12345')
55  # +5 bytes, compared to 't'
>>> sys.getsizeof('я12345')
86  # +10 bytes, compared to 'я'

This has to do with the internal representation of strings since Python 3.3. See PEP 393 -- Flexible String Representation for more details.

Python, in general, is not very memory efficient, when it comes to having lots of small objects, not just for strings. See these examples:

>>> sys.getsizeof(1)
28
>>> sys.getsizeof(True)
28
>>> sys.getsizeof([])
56
>>> sys.getsizeof(dict())
232
>>> sys.getsizeof((1,1))
56
>>> sys.getsizeof([1,1])
72

Internalizing strings could help, but make sure you don't have too many unique values, as that could do more harm than good.

It's hard to tell how to optimize your specific case, as there is no single universal solution. You could save up a lot of memory if you somehow serialize data from multiple items into a single byte buffer, for example, but then that could complicate your code or affect performance too much. In many cases it won't be worth it, but if I were in a situation where I really needed to optimize memory usage, I would also consider writing that part in a language like Rust (it's not too hard to create a native Python module via PyO3 for example).

Upvotes: 7

senderle
senderle

Reputation: 151097

In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern will ensure that those repetitions don't result in unique objects, but all refer to the same base object.

It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern in this case is probably the way to go; YMMV.

As an elaboration on why this is very likely to save memory, consider the following:

>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43

Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.

Upvotes: 34

Karl Barker
Karl Barker

Reputation: 11361

If your strings are so short, it is likely there will be a significant number of duplicates. Python interning will optimise it so that these strings are stored only once and the reference used multiple tiems, rather than storing the string multiple times...

These strings should be automatically interned as there are.

Upvotes: 1

Related Questions