acdr
acdr

Reputation: 4726

Downside to interning strings?

Consider:

a = str(123456789)
b = str(123456789)
a is b # False

The latter line evaluates as False because a and b are not the same object, even though they could be (because strings are immutable). Hence, if I have a lot of "copies" of the same string alive, I may be using way more memory than I need. This is why intern (Py2) and sys.intern (Py3) exist!

a = intern(str(123456789))
b = intern(str(123456789)) # Call to "intern" technically pointless
a is b # True

Is there any downside to using intern from a memory perspective? (So beyond the tiny time cost to call a function.) I understand from the docs (e.g. https://docs.python.org/2/library/functions.html#intern) that a string only stays in the intern table for as long as I keep a reference to it alive, so in the case of only having one copy of the string, that should use the same amount of memory as just assigning to the string directly, and if I have multiple copies, then obviously memory usage is lower when I intern.

Upvotes: 3

Views: 408

Answers (2)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477318

The reduction of memory of course depends on the number of "duplicates". In case there are no duplicates this actually only consumes more memory since Python also stores a hashtable to do lookups for the interning process (it somehow has to check that the string already exists).

Furthermore there are two advantages of string interning: (1) faster equality checks: you simply compare the references (as you do here with is); and (2) usually a memory reduction, since you of course aim to intern "interesting" strings.

Upvotes: 2

Martijn Pieters
Martijn Pieters

Reputation: 1124070

There could be two downsides:

  • The CPU cost of using the sys.intern() call. Calling a function requires the current frame to be pushed on the stack and popped again when the function returns. If you do this for a lot of strings the cost adds up. It's a tradeoff of CPU cycles vs. memory you need to take into account.

  • You may end up using more memory if your strings are mostly used singly. Interning also looks up the string object in a hash table, which by necessity needs to allocate more memory slots than the number of strings stored. Using a hashtable with N + overhead percentage slots could outstrip the memory needed for N strings, each used infrequently and thus not duplicated.

That said, we've used interning successfully and to significant effect in a multi-gigabyte in-memory cache, where strings by necessity appear in multiple locations in a tree structure.

Upvotes: 3

Related Questions