Adam Matan
Adam Matan

Reputation: 136191

Python string formatting + UTF-8 strange behaviour

When printing a formatted string with a fixed length (e.g, %20s), the width differs from UTF-8 string to a normal string:

>>> str1="Adam Matan"
>>> str2="אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X        אדם מתן X

Note the difference:

X           Adam Matan X
X        אדם מתן X

Any ideas?

Upvotes: 3

Views: 2315

Answers (3)

user355252
user355252

Reputation:

In Python 2 unprefixed string literals are of type str, which is a byte string. It stores arbitrary bytes, not characters. UTF-8 encodes some characters with more than one bytes. str2 therefore contains more bytes than actual characters, and shows the unexpected, but perfectly valid behaviour in string formatting. If you look at the actual byte content of these strings (use repr instead of print), you'll see, that in both strings the field is actually 20 bytes (not characters!) long.

As already mentioned, the solution is to use unicode strings. When working with strings in Python, you absolutely need to understand and realize the difference between unicode and byte strings.

Upvotes: 3

tghw
tghw

Reputation: 25303

You need to specify that the second string is Unicode by putting u in front of the string:

>>> str1="Adam Matan"
>>> str2=u"אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X              אדם מתן X

Doing this lets Python know that it's counting Unicode characters, not just bytes.

Upvotes: 7

Michał Kwiatkowski
Michał Kwiatkowski

Reputation: 9764

Try this way:

>>> str1="Adam Matan"
>>> str2=unicode("אדם מתן", "utf8")
>>> print "X %20s X" % str2
X              אדם מתן X
>>> print "X %20s X" % str1
X           Adam Matan X

Upvotes: 1

Related Questions