python unicode get value / get text

Question

Let's say I have an unicode variable:

uni_var = u'Na teatr w pi\xc4\x85tek'

I want to have a string, which will be the same as uni_var, just without the "u", so:

str_var = 'Na teatr w pi\xc4\x85tek'

How can I do it? I would like to find something like:

str_var = uni_var.text()

Martijn Pieters · Accepted Answer

You appear to have badly decoded Unicode; those are UTF-8 bytes masking as Latin-1 codepoints.

You can get back to proper UTF-8 bytes by encoding to a codec that maps Unicode codepoints one-on-one to bytes, like Latin-1:

>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> uni_var.encode('latin1')
'Na teatr w pi\xc4\x85tek'

but be careful; it could also be that the CP1252 encoding was used to decode to Unicode here. It all depends on where this Mojibake was produced.

You could also use the ftfy library to detect how to best repair this; it produces Unicode output:

>>> import ftfy
>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> ftfy.fix_text(uni_var)
u'Na teatr w pi\u0105tek'
>>> print ftfy.fix_text(uni_var)
Na teatr w piątek

The library will handle CP1252 Mojibake's automatically.

python unicode get value / get text

Answers (2)

Related Questions