behnam
behnam

Reputation: 1979

In Python runtime, is there a way to distinguish literal string instances from dynamically created ones?

In Python runtime, is there a way to distinguish literal string instances from dynamically created ones?

For example, I want to be able to tell the difference between these two values:

val1 = "Foo"
var2 = "%s" % "Foo"

An example use case for this check is to protect a string.Template-like function from any attacks, like exposing value of local variables.

If it's not possible, is there any good reason for it?


And a side note...

PEP 498 -- Literal String Interpolation introduces f-strings, which are string literals which may split into literals and expressions at tokenization time.

F-strings work fairly similar to string.Template(), but has the enforcement of the input being a literal string, at the cost of syntax update for the language.

If this kind of check has been available on runtime, f-strings could have been implemented as a function.


Update 1

As noted by @kevin in his answer, CPython has optimizations that allows it to reuse existing instances when there's no need to create new ones. In my first example, "%s" % "Foo" is skipped with just linking to existing "Foo" instance.

But that's not a language requirement, and in fact doesn't always happen. Any string formatting other than some obvious ones would result in creation of a new instance.

In the following example, you can see that although the strings are equal by value, they are not the same object. Using sys.intern() would give us the same instance, though.

In [1]: import dis
   ...: import sys
   ...:
   ...: def foo():
   ...:     var1 = "Foo Bar"
   ...:     var2 = "%s %s" % ("Foo", "Bar")
   ...:     print(f'plain eq: {var1 == var2}')
   ...:     print(f'plain is: {var1 is var2}')
   ...:     print(f'intern is: {sys.intern(var1) is sys.intern(var2)}')
   ...:
   ...: dis.dis(foo)
   ...: foo()
   ...:
  5           0 LOAD_CONST               1 ('Foo Bar')
              2 STORE_FAST               0 (var1)

  6           4 LOAD_CONST               9 ('Foo Bar')
              6 STORE_FAST               1 (var2)

  7           8 LOAD_GLOBAL              0 (print)
             10 LOAD_CONST               5 ('plain eq: ')
             12 LOAD_FAST                0 (var1)
             14 LOAD_FAST                1 (var2)
             16 COMPARE_OP               2 (==)
             18 FORMAT_VALUE             0
             20 BUILD_STRING             2
             22 CALL_FUNCTION            1
             24 POP_TOP

  8          26 LOAD_GLOBAL              0 (print)
             28 LOAD_CONST               6 ('plain is: ')
             30 LOAD_FAST                0 (var1)
             32 LOAD_FAST                1 (var2)
             34 COMPARE_OP               8 (is)
             36 FORMAT_VALUE             0
             38 BUILD_STRING             2
             40 CALL_FUNCTION            1
             42 POP_TOP

  9          44 LOAD_GLOBAL              0 (print)
             46 LOAD_CONST               7 ('intern is: ')
             48 LOAD_GLOBAL              1 (sys)
             50 LOAD_ATTR                2 (intern)
             52 LOAD_FAST                0 (var1)
             54 CALL_FUNCTION            1
             56 LOAD_GLOBAL              1 (sys)
             58 LOAD_ATTR                2 (intern)
             60 LOAD_FAST                1 (var2)
             62 CALL_FUNCTION            1
             64 COMPARE_OP               8 (is)
             66 FORMAT_VALUE             0
             68 BUILD_STRING             2
             70 CALL_FUNCTION            1
             72 POP_TOP
             74 LOAD_CONST               0 (None)
             76 RETURN_VALUE
plain eq: True
plain is: False
intern is: True

As documented in sys.intern(), "Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys." In other words, normally, runtime string instances are not interned.

Upvotes: 3

Views: 171

Answers (1)

Kevin
Kevin

Reputation: 30161

No, you can't do that. Or at least, you can't do it at runtime. If you're willing to accept the limitations of compile-time analysis, you can parse and examine Python code with ast, but that is probably a far more involved tool than what you are looking for, and certainly will not allow you to "implement f-strings as a function."

For the specific case of your example, the Python language specification permits var1 and var2 to both point to the same object (and they definitely will if you pass both of them through the sys.intern() function and compare the results). Since a conforming Python implementation could alias them, there is no reliable way to tell them apart. In fact, when I tried it in CPython 3.6.1, they were aliased:

import dis

def foo():
    var1 = "Foo"
    var2 = "%s" % "Foo"
    return var1 is var2

dis.dis(foo)
print(foo())

Output:

  4           0 LOAD_CONST               1 ('Foo')
              2 STORE_FAST               0 (var1)

  5           4 LOAD_CONST               3 ('Foo')
              6 STORE_FAST               1 (var2)

  6           8 LOAD_FAST                0 (var1)
             10 LOAD_FAST                1 (var2)
             12 COMPARE_OP               8 (is)
             14 RETURN_VALUE
True

Notice that it didn't even waste time computing var2. It got constant-folded into the literal value 'Foo', which was then deduplicated with the other 'Foo' which the function was already using for var1.

(A more aggressive optimizer could have then propagated those constants and converted var1 is var2 into True, but CPython does not do that (yet?), probably because it is rare to use is for immuntable values like strings. Most of the other operations which could plausibly benefit from constant propagation are subject to various kinds of monkey patching, which prevents this optimization in the vast majority of real-world use cases. As such, I presume that it is not worth implementing.)

If it's not possible, is there any good reason for it?

Because Python, like most imperative languages, uses eager evaluation, which throws this information away immediately. With a lazy-evaluated language, this question would at least be reasonable to ask, but I don't believe most of them preserve this information either. The question of whether a string is literal or non-literal simply isn't considered a part of the string's value, in most programming languages that deal with strings.

Upvotes: 3

Related Questions