Reputation: 1239
I want to split the string I have £300
but it seems that the split function first converts it to a ascii and after. But I can't convert it back to unicode the same as it was before.
Is there any other way to split such a unicode string without breaking it as in the snippet bellow.
# -*- coding: utf-8 -*-
mystring = 'I have £300.'
alist = mystring.split()
alist = [item.decode("utf-8") for item in alist]
print "alist",alist
print "mystring.split()",mystring.split()
#I want to get [I,have,£300]
#I get: ['I', 'have', '\xc2\xa3300.']
Upvotes: 1
Views: 1197
Reputation: 50200
The problem is not with split()
. The real problem is that the handling of unicode in python 2 is confusing.
The first line in your code produces a string, i.e. a sequence of bytes, which contains the utf-8 encoding of the symbol £
. You can confirm this by displaying the repr
of your original string:
>>> mystring
'I have \xc2\xa3300.'
The rest of the statements just do what you would expect them to with such input. If you want to work with unicode, create a unicode string to start with:
>>> mystring = u'I have £300.'
A far better solution, however, is to switch to Python 3. Wrapping your head around the semantics of unicode in python 2 is not worth the effort when there's such a superior alternative.
Upvotes: 1
Reputation: 113864
You are looking at a limitation of the way python 2 displays data.
Using python 2:
>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '\xc2\xa3300.']
But, observe that it will print as you want:
>>> print(mystring.split()[2])
£300.
Using python 3, by contrast, it displays as you would like:
>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '£300.']
A major reason to use python 3 is its superior handling of unicode.
Upvotes: 3