Reputation: 57670
I have an string like following.
<GPE>LUSAKA</GPE> (<ORG>AP</ORG>) -- X&Y Ltd. & M.K. Ltd will be merged.
How can I make it valid XML so my etree.XMLParser does not throw error. I need to convert it to something like.
<GPE>LUSAKA</GPE> (<ORG>AP</ORG>) -- X&Y Ltd. & M.K. Ltd will be merged.
For this I tried to use tidylib
. But it removed all the custom tags. See the code
options = {
'wrap': 0,
'indent': 0,
'output-xhtml': 1,
'numeric-entities': 1
}
html, warnings = tidylib.tidy_fragment(data, options)
Output is
LUSAKA (AP) -- X&Y Ltd. & M.K. Ltd will be merged.
Upvotes: 0
Views: 151
Reputation: 11396
>>> from lxml import etree
>>> tree = etree.fromstring('<GPE>LUSAKA</GPE> (<ORG>AP</ORG>) -- X&Y Ltd. & M.K. Ltd will be merged.', etree.HTMLParser())
>>> etree.tostring(tree)
'<html><body><gpe>LUSAKA</gpe> (<org>AP</org>) -- X&Y Ltd. & M.K. Ltd will be merged.</body></html>'
>>> tree.xpath('//gpe/text()')
['LUSAKA']
Upvotes: 1