mix
mix

Reputation: 7151

Why is escaping of single quotes inconsistent on file read in Python?

Given two nearly identical text files (plain text, created in MacVim), I get different results when reading them into a variable in Python. I want to know why this is and how I can produce consistent behavior.

For example, f1.txt looks like this:

This isn't a great example, but it works.

And f2.txt looks like this:

This isn't a great example, but it wasn't meant to be. 
"But doesn't it demonstrate the problem?," she said.

When I read these files in, using something like the following:

f = open("f1.txt","r")
x = f.read()

I get the following when I look at the variables in the console. f1.txt:

>>> x
"This isn't a great example, but it works.\n\n"

And f2.txt:

>>> y
'This isn\'t a great example, but it wasn\'t meant to be. \n"But doesn\'t it demonstrate the problem?," she said.\n\n'

In other words, f1 comes in with only escaped newlines, while f2 also has its single quotes escaped.

repr() shows what's going on. first for f1:

>>> repr(x)
'"This isn\'t a great example, but it works.\\n\\n"'

And f2:

>>> repr(y)
'\'This isn\\\'t a great example, but it wasn\\\'t meant to be. \\n"But doesn\\\'t it demonstrate the problem?," she said.\\n\\n\''

This kind of behavior is driving me crazy. What's going on and how do I make it consistent? If it matters, I'm trying to read in plain text, manipulate it, and eventually write it out so that it shows the properly escaped characters (for pasting into Javascript code).

Upvotes: 5

Views: 3953

Answers (2)

kindall
kindall

Reputation: 184211

Python is giving you a string literal which, if you gave it back to Python, would result in the same string. This is known as the repr() (short for "representation") of the string. This may not (probably won't, in fact) match the string as it was originally specified, since there are so many ways to do that, and Python does not record anything about how it was originally specified.

It uses double quotes around your first example, which works fine because it doesn't contain any double quotes. The second string contains double quotes, so it can't use double quotes as a delimiter. Instead it uses single quotes and uses backslashes to escape the single quotes in the string (it doesn't have to escape the double quotes this way, and there are more of them than there are single quotes). This keeps the representation as short as possible.

There is no reason for this behavior to drive you crazy and no need to try to make it consistent. You only get the repr() of a string when you are peeking at values in Python's interactive mode. When you actually print or otherwise use the string, you get the string itself, not a reconstituted string literal.

If you want to get a JavaScript string literal, the easiest way is to use the json module:

import json
print json.dumps('I said, "Hello, world!"')

Upvotes: 16

abarnert
abarnert

Reputation: 365767

Both f1 and f2 contain perfectly normal, unescaped single quotes.

The fact that their repr looks different is meaningless.

There are a variety of different ways to represent the same string. For example, these are all equivalent literals:

"abc'def'ghi"
'abc\'def\'ghi'
'''abc'def'ghi'''
r"abc'def'ghi"

The repr function on a string always just generates some literal that is a valid representation of that string, but you shouldn't depend on exactly which one it generate. (In fact, you should rarely use it for anything but debugging purposes in the first place.)


Since the language doesn't define anywhere what algorithm it uses to generate a repr, it could be different for each version of each implementation.

Most of them will try to be clever, using single or double quotes to avoid as many escaped internal quotes as possible, but even that isn't guaranteed. If you really want to know the algorithm for a particular implementation and version, you pretty much have to look at the source. For example, in CPython 3.3, inside unicode_repr, it counts the number of quotes of each type; then if there are single quotes but no double quotes, it uses " instead of '.


If you want "the" representation of a string, you're out of luck, because there is no such thing. But if you want some particular representation of a string, that's no problem. You just have to know what format you want; most formats, someone's already written the code, and often it's in the standard library. You can make C literal strings, JSON-encoded strings, strings that can fit into ASCII RFC822 headers… But all of those formats have different rules from each other (and from Python literals), so you have to use the right function for the job.

Upvotes: 7

Related Questions