Reputation: 4332
So i have a need to process some HTML in Python, and my requirement is that i need to find a certain tag and replace it with different charecter based on the content of the charecters...
<html>
<Head>
</HEAD>
<body>
<blah>
<_translate attr="french"> I am no one,
and no where <_translate>
<Blah/>
</body>
</html>
Should become
<html>
<Head>
</HEAD>
<body>
<blah>
Je suis personne et je suis nulle part
<Blah/>
</body>
</html>
I would like to leave the original HTML untouched an only replace the tags labeled 'important-tag'. Attributes and the contents of that tag will be important to generate the tags output.
I had though about using extending HTMLParser Object but I am having trouble getting out the orginal HTML when i want it. I think what i most want is to parse the HTML into tokens, with the orginal text in each token so i can output my desired output ... i.e. get somthing like
(tag, "<html>")
(data, "\n ")
(tag, "<head>")
(data, "\n ")
(end-tag,"</HEAD>")
ect...
ect...
Anyone know of a good pythonic way to accomplish this ? Python 2.7 standard libs are prefered, third party would also be useful to consider...
Thanks!
Upvotes: 0
Views: 72
Reputation: 3857
You can use lxml to perform such a task http://lxml.de/tutorial.html and use XPath to navigate easily trough your html:
from lxml.html import fromstring
my_html = "HTML CONTENT"
root = fromstring(my_html)
nodes_to_process = root.xpath("//_translate")
for node in nodes_to_process:
lang = node.attrib["attr"]
translate = AWESOME_TRANSLATE(node.text, lang)
node.parent.text = translate
I'll leave up to you the implementation of the AWESOME_TRANSLATE function ;)
Upvotes: 2