Reputation: 343
Can someone explain why the middle code excerpt in python 2.7x throws an error?
import re
walden = "Waldenström"
walden
print(walden)
s1 = "ö"
s2 = "Wal"
s3 = "OOOOO"
out = re.sub(s1, s3, walden)
print(out)
out = re.sub("W", "w", walden)
print(out)
# I need this one to work
out = re.sub('W', u'w', walden)
# ERROR
out = re.sub(u'W', 'w', walden)
print(out)
out = re.sub(s2, s1, walden)
print(out)
I'm very confused and have tried reading the manual
Upvotes: 1
Views: 35
Reputation: 19352
walden
is a str
:
walden = "Waldenström"
This code replaces a character with a unicode
string:
re.sub('W', u'w', walden)
The result of that should be u'w' + "aldenström"
. This is the part that fails.
In order to concatenate str
and unicode
, both have to be first converted to unicode
. The result is unicode
as well.
The problem is, the interpreter does not know how to convert 'ö'
to unicode, because it does not know which encoding to use. The result is ambiguous.
The solution is to convert yourself before doing the replacement:
re.sub('W', u'w', unicode(walden, encoding))
The encoding
should be the one you use to create that file, e.g.
re.sub('W', u'w', unicode(walden, 'utf-8'))
Upvotes: 2