Reputation: 136191
When printing a formatted string with a fixed length (e.g, %20s
), the width differs from UTF-8 string to a normal string:
>>> str1="Adam Matan"
>>> str2="אדם מתן"
>>> print "X %20s X" % str1
X Adam Matan X
>>> print "X %20s X" % str2
X אדם מתן X
Note the difference:
X Adam Matan X
X אדם מתן X
Any ideas?
Upvotes: 3
Views: 2315
Reputation:
In Python 2 unprefixed string literals are of type str
, which is a byte string. It stores arbitrary bytes, not characters. UTF-8 encodes some characters with more than one bytes. str2
therefore contains more bytes than actual characters, and shows the unexpected, but perfectly valid behaviour in string formatting. If you look at the actual byte content of these strings (use repr
instead of print
), you'll see, that in both strings the field is actually 20 bytes (not characters!) long.
As already mentioned, the solution is to use unicode strings. When working with strings in Python, you absolutely need to understand and realize the difference between unicode and byte strings.
Upvotes: 3
Reputation: 25303
You need to specify that the second string is Unicode by putting u
in front of the string:
>>> str1="Adam Matan"
>>> str2=u"אדם מתן"
>>> print "X %20s X" % str1
X Adam Matan X
>>> print "X %20s X" % str2
X אדם מתן X
Doing this lets Python know that it's counting Unicode characters, not just bytes.
Upvotes: 7
Reputation: 9764
Try this way:
>>> str1="Adam Matan"
>>> str2=unicode("אדם מתן", "utf8")
>>> print "X %20s X" % str2
X אדם מתן X
>>> print "X %20s X" % str1
X Adam Matan X
Upvotes: 1