user3662991
user3662991

Reputation: 1133

Converting utf-8 encoded string to just plain text in python 3

So I've been getting all caught up in unicode and utf-8 as i have a script which grabs images and their titles off the web. Works great, except when their title has special characters (eg. Jökulsárlón.)

it comes out as unicode :-

J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n

So i want a way to turn that string into plain text- whether is turning them into nearest 'normal' letters (like plain o instead of ö) or printing those actual symbols (rather than \xc3 etc.) I've tried a billion different ways, but a lot of the things i've been reading havent worked for me in python 3.

Thanks in advance

Upvotes: 2

Views: 9901

Answers (3)

Mark Tolonen
Mark Tolonen

Reputation: 177481

If your string is <class 'str'> and it prints literally J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n, then the last line below will decode it:

>>> s='J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> type(s)
<class 'str'>
>>> s
'J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> s.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
'Jökulsárlón'

How it got that convoluted is unknown. If this isn't the solution, then update your question with the type of the variable holding the string (type(s) for example) and the exact value as shown above for my example.

Upvotes: 1

Javier
Javier

Reputation: 2776

J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n is not unicode. It may be UTF-8 though.

To turn them into Unicode you have to decode them. s.decode('utf-8') if it were UTF-8, for example.

Before printing or writing you have to encode them again. If you encode to ASCII, the encode method accepts an option that tells it what to do with code points that cannot be represented in the given encoding.

For example: print(s.encode('ascii', errors='ignore')

errors accepts more options.

Upvotes: 1

Simeon Visser
Simeon Visser

Reputation: 122336

It's indeed UTF-8 but they're bytes:

>>> b = b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b
b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b.decode('utf-8')
'Jökulsárlón'

As this is Python 3.x, this is a Unicode string.

Upvotes: 2

Related Questions