Reputation: 1083
Let's say I have an unicode variable:
uni_var = u'Na teatr w pi\xc4\x85tek'
I want to have a string, which will be the same as uni_var
, just without the "u", so:
str_var = 'Na teatr w pi\xc4\x85tek'
How can I do it? I would like to find something like:
str_var = uni_var.text()
Upvotes: 0
Views: 1071
Reputation: 1123420
You appear to have badly decoded Unicode; those are UTF-8 bytes masking as Latin-1 codepoints.
You can get back to proper UTF-8 bytes by encoding to a codec that maps Unicode codepoints one-on-one to bytes, like Latin-1:
>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> uni_var.encode('latin1')
'Na teatr w pi\xc4\x85tek'
but be careful; it could also be that the CP1252 encoding was used to decode to Unicode here. It all depends on where this Mojibake was produced.
You could also use the ftfy
library to detect how to best repair this; it produces Unicode output:
>>> import ftfy
>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> ftfy.fix_text(uni_var)
u'Na teatr w pi\u0105tek'
>>> print ftfy.fix_text(uni_var)
Na teatr w piątek
The library will handle CP1252 Mojibake's automatically.
Upvotes: 2
Reputation: 61253
You need to encode your string to Latin-1
>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> uni_var.encode('Latin-1')
'Na teatr w pi\xc4\x85tek'
Upvotes: 1