Brana
Brana

Reputation: 1239

Cannot split a unicode string without converting to ascii - python 2.7

I want to split the string I have £300 but it seems that the split function first converts it to a ascii and after. But I can't convert it back to unicode the same as it was before.

Is there any other way to split such a unicode string without breaking it as in the snippet bellow.

# -*- coding: utf-8 -*-
mystring = 'I have £300.'
alist = mystring.split()
alist = [item.decode("utf-8") for item in alist]
print "alist",alist
print "mystring.split()",mystring.split()

#I want to get [I,have,£300]
#I get: ['I', 'have', '\xc2\xa3300.']

Upvotes: 1

Views: 1197

Answers (2)

alexis
alexis

Reputation: 50200

The problem is not with split(). The real problem is that the handling of unicode in python 2 is confusing.

The first line in your code produces a string, i.e. a sequence of bytes, which contains the utf-8 encoding of the symbol £. You can confirm this by displaying the repr of your original string:

>>> mystring
'I have \xc2\xa3300.'

The rest of the statements just do what you would expect them to with such input. If you want to work with unicode, create a unicode string to start with:

>>> mystring = u'I have £300.'

A far better solution, however, is to switch to Python 3. Wrapping your head around the semantics of unicode in python 2 is not worth the effort when there's such a superior alternative.

Upvotes: 1

John1024
John1024

Reputation: 113864

You are looking at a limitation of the way python 2 displays data.

Using python 2:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '\xc2\xa3300.']

But, observe that it will print as you want:

>>> print(mystring.split()[2])
£300.

Using python 3, by contrast, it displays as you would like:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '£300.']

A major reason to use python 3 is its superior handling of unicode.

Upvotes: 3

Related Questions