How Can I Output Non-Escaped Element Tag In XML?

Question

I have a Python script that I have inherited and my issue is that right now I have a chunk of text in a paragraph variable that contains anchor tags. For example:

This is text with a Link in it.

What I'm required to do however is convert the anchor tags to the apxh name space so the above line should look something like this:

This is text with a Link in it.

The problem is the way I have it above is outputting:

This is text with a <apxh:a href="http://somewebsite.com;">Link Text;</apxh:a> in it.

My guess is that when I'm running the for loop on the paragraph, I need to some how find all anchor tags and text and do something like etree.Element("{%s}a" % nm["apxh"], nsmap=nm) but I'm not really sure

This is the current script:

def get_news_feed(request):
    articles = models.Article.objects.all().filter(distributable = True)

    nm = {
            None: "http://www.w3.org/2005/Atom",
            "ap": "http://ap.org/schemas/03/2005/aptypes",
            "apcm": "http://ap.org/schemas/03/2005/apcm",
            "apnm": "http://ap.org/schemas/03/2005/apnm",
            "apxh": "http://www.w3.org/1999/xhtml",
            }

    doc = etree.Element("{%s}feed" % nm[None], nsmap=nm)

    for article in articles:
        entry = etree.Element("{%s}entry" % nm[None], nsmap=nm)
        content = etree.Element("{%s}content" % nm[None], nsmap=nm)
        content.set("type", "xhtml")

        div = etree.Element("{%s}div" % nm["apxh"], nsmap=nm)
        for paragraph in article.body.replace("&", "&").split("
"):
            par = etree.Element("{%s}p" % nm["apxh"], nsmap=nm)
            par.text = paragraph            
            par.text = paragraph.replace("



This is how the output should look:



  
    
      
        This is some text
        This is text with a Link in it.
        Theater
      
    
  



This is how the output currently looks:



  
    
      
        This is some text
        This is text with a <apxh:a href="http://somewebsite.com;">Link Text;</apxh:a> in it.
        Theater

Charles Duffy · Accepted Answer

Don't inject your content as literal text -- render it into DOM content, with a namespace map that implicitly makes the default namespace the same one mapped to aphx:

import lxml.etree as etree
text='This is text with a Link in it.'
text_el = etree.fromstring('' + text + '')

...then put the contents of text_el inside your par.

Doing that might look like the following:

par = etree.Element('{http://www.w3.org/1999/xhtml}div', nsmap=nm)
par.text = text_el.text
for child_el in text_el[:]:
  par.append(child_el)

Because the nsmap nm is used above, converting this back to a string will honor the namespace prefixes contained therein, thus using apxh for content left in the default namespace (which we mapped with xmlns= inside the artificial root).

In discussion in comments, it's come up that some of your production data looks like:

u'John Doe: 360-555-4546; John.mailto:john.doe@website.com twitter.com/JohnDoe'

etree.fromstring() will throw an exception when given this input, because it isn't valid XML (or valid XHTML), on account of the backslashes.

If you're quite sure that " won't ever occur in valid input, you might consider:

text_el = etree.fromstring(
  '' +
  text.replace('\"', '"') +
  '')

How Can I Output Non-Escaped Element Tag In XML?

Answers (1)

Related Questions