HTMLParsing in Python

Question

So i have a need to process some HTML in Python, and my requirement is that i need to find a certain tag and replace it with different charecter based on the content of the charecters...


   
   
   
     
       <_translate attr="french"> I am no one, 
           and no where <_translate>

Should become


   
   
   
     
       Je suis personne et je suis nulle part

I would like to leave the original HTML untouched an only replace the tags labeled 'important-tag'. Attributes and the contents of that tag will be important to generate the tags output.

I had though about using extending HTMLParser Object but I am having trouble getting out the orginal HTML when i want it. I think what i most want is to parse the HTML into tokens, with the orginal text in each token so i can output my desired output ... i.e. get somthing like

(tag, "")
(data, "
    ")
(tag, "")
(data, "
    ")
(end-tag,"")
ect...
ect...

Anyone know of a good pythonic way to accomplish this ? Python 2.7 standard libs are prefered, third party would also be useful to consider...

Thanks!

HTMLParsing in Python

Answers (1)

Related Questions