Reputation: 3968
I have a Python script that I have inherited and my issue is that right now I have a chunk of text in a paragraph
variable that contains anchor tags. For example:
This is text with a <a href="http://somewebsite.com">Link</a> in it.
What I'm required to do however is convert the anchor tags to the apxh
name space so the above line should look something like this:
This is text with a <apxh:a href="http://somewebsite.com">Link</apxh:a> in it.
The problem is the way I have it above is outputting:
This is text with a <apxh:a href=\"http://somewebsite.com;\">Link Text;</apxh:a> in it.
My guess is that when I'm running the for loop on the paragraph
, I need to some how find all anchor tags and text and do something like etree.Element("{%s}a" % nm["apxh"], nsmap=nm)
but I'm not really sure
This is the current script:
def get_news_feed(request):
articles = models.Article.objects.all().filter(distributable = True)
nm = {
None: "http://www.w3.org/2005/Atom",
"ap": "http://ap.org/schemas/03/2005/aptypes",
"apcm": "http://ap.org/schemas/03/2005/apcm",
"apnm": "http://ap.org/schemas/03/2005/apnm",
"apxh": "http://www.w3.org/1999/xhtml",
}
doc = etree.Element("{%s}feed" % nm[None], nsmap=nm)
for article in articles:
entry = etree.Element("{%s}entry" % nm[None], nsmap=nm)
content = etree.Element("{%s}content" % nm[None], nsmap=nm)
content.set("type", "xhtml")
div = etree.Element("{%s}div" % nm["apxh"], nsmap=nm)
for paragraph in article.body.replace("&", "&").split("\n"):
par = etree.Element("{%s}p" % nm["apxh"], nsmap=nm)
par.text = paragraph
par.text = paragraph.replace("<a", "<apxh:a")
par.text = par.text.replace("</a", "</apxh:a")
par.text = cleanup_entities(par.text)
div.append(par)
content.append(div)
entry.append(content)
doc.append(entry)
output = etree.tostring(doc, encoding="UTF-8", xml_declaration=True, pretty_print=True)
return HttpResponse(output, mimetype="application/xhtml+xml")
This is how the output should look:
<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:ap="http://ap.org/schemas/03/2005/aptypes" xmlns:apxh="http://www.w3.org/1999/xhtml" xmlns:apnm="http://ap.org/schemas/03/2005/apnm" xmlns:apcm="http://ap.org/schemas/03/2005/apcm" xmlns="http://www.w3.org/2005/Atom">
<entry>
<content type="xhtml">
<apxh:div>
<apxh:p>This is some text</apxh:p>
<apxh:p>This is text with a <apxh:a href="http://somewebsite.com">Link</apxh:a> in it.</apxh:p>
<apxh:p>Theater</apxh:p>
</apxh:div>
</content>
</entry>
</feed>
This is how the output currently looks:
<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:ap="http://ap.org/schemas/03/2005/aptypes" xmlns:apxh="http://www.w3.org/1999/xhtml" xmlns:apnm="http://ap.org/schemas/03/2005/apnm" xmlns:apcm="http://ap.org/schemas/03/2005/apcm" xmlns="http://www.w3.org/2005/Atom">
<entry>
<content type="xhtml">
<apxh:div>
<apxh:p>This is some text</apxh:p>
<apxh:p>This is text with a <apxh:a href=\"http://somewebsite.com;\">Link Text;</apxh:a> in it.</apxh:p>
<apxh:p>Theater</apxh:p>
</apxh:div>
</content>
</entry>
</feed>
Upvotes: 0
Views: 882
Reputation: 295678
Don't inject your content as literal text -- render it into DOM content, with a namespace map that implicitly makes the default namespace the same one mapped to aphx
:
import lxml.etree as etree
text='This is text with a <a href="http://somewebsite.com">Link</a> in it.'
text_el = etree.fromstring('<root xmlns="http://www.w3.org/1999/xhtml">' + text + '</root>')
...then put the contents of text_el
inside your par
.
Doing that might look like the following:
par = etree.Element('{http://www.w3.org/1999/xhtml}div', nsmap=nm)
par.text = text_el.text
for child_el in text_el[:]:
par.append(child_el)
Because the nsmap nm
is used above, converting this back to a string will honor the namespace prefixes contained therein, thus using apxh
for content left in the default namespace (which we mapped with xmlns=
inside the artificial root).
In discussion in comments, it's come up that some of your production data looks like:
u'John Doe: 360-555-4546; <a href=\\"mailto:[email protected];\\">John.mailto:[email protected]</a> twitter.com/JohnDoe'
etree.fromstring()
will throw an exception when given this input, because it isn't valid XML (or valid XHTML), on account of the backslashes.
If you're quite sure that \"
won't ever occur in valid input, you might consider:
text_el = etree.fromstring(
'<root xmlns="http://www.w3.org/1999/xhtml">' +
text.replace('\\"', '"') +
'</root>')
Upvotes: 1