David542
David542

Reputation: 110083

Removing lxml control characters

I have the following string:

s = '''L\'eredit\xc3\xa0 della leggenda del ballo Honey Daniels continua a vivere.\nDopo un periodo passato
in riformatorio Maria cerca di ricostruire la propria vita con nient\'altro che il suo talento per la street dance e un desiderio
bruciante di mettersi alla prova. Maria si getta anima
e corpo nella danza e accetta di allenare un gruppo di giovani inesperti (gli "HD"), per partecipare alla competizione t
elevisiva Dance Battle Zone. Come Honey prima di lei, la giovane riscoprir\xc3\xa0 se stessa e capir\xc3\xa0 cosa 
vuole veramente nella vita attraverso l\'emo
zione della danza. \xc2\xa9 2010 Universal Studios Home Entertainment Productions LLC. All Rights Reserved.'''

I am trying to write an xml node with it as follows:

>>> from lxml import etree
>>> etree.Element('Items')
>>> x=etree.Element('Items')
>>> item.text=s
    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

There is another question similar to this one that I wasn't able to use to solve the above issue: Filtering out certain bytes in python. How would I fix the above?

Upvotes: 1

Views: 868

Answers (1)

jcomeau_ictx
jcomeau_ictx

Reputation: 38422

since your text is already encoded UTF8, you need to decode it to Unicode.

x.text = s.decode('utf8')

Upvotes: 3

Related Questions