mel
mel

Reputation: 2790

Encoding python utf-8 splitting

I've got a variable:

age_expectations = dictionary['looking_for']['age']
print type(age_expectations), age_expectations

The output is:

<type 'unicode'> 22‑35

When I'm trying to split it with the dash I've got the following problem:

res = age_expectations.split('-')
print res

And the output look like:

[u'22\u201135']

Instead of:

["22", "35"]

What is the problem? I've tried many encoding and decoding but not really sure to understand how it's work. Does the problem come from the split?

Upvotes: 2

Views: 1658

Answers (2)

bruno desthuilliers
bruno desthuilliers

Reputation: 77912

As you can see from your code, the hyphen in your age_expectations variable is the unicode U+2011 character, not the standard "-" hyphen. You would have seen it from the start if you had printed the variable's representation instead:

>>> uu = u"22\u201135"
>>> print uu
22‑35
>>> print repr(uu)
u'22\u201135'
>>> 

So you need to either replace the u"\u2011" character with a simple hyphen (if you can have any of them in your data) or just simply split the string on u"\u2011" (if you're sure you'll always get this as delimiter).

Upvotes: 1

Praveen
Praveen

Reputation: 9345

Use unicode to split the unicode like,

>>> u_code = u'\u0032\u0032\u2011\u0033\u0035'
>>> print u_code
22‑35
>>> u_code.split('-')
[u'22\u201135']
>>> u_code.split(u'\u2011')
[u'22', u'35']
>>>

Upvotes: 2

Related Questions