Parsing and editing HTML files using Python

Question

Issue is following: Got some basic HTML auto-generated file as a dump from object database. It's table-based information. The structure of file it's same for each generation, generally coherent content. I have to process this file further, do some remarks, etc, thus I wish to edit a bit this HTML file to let's say add extra table cell with writeable text field to add remarks in file and maybe some final button to generate some additional output. Now the questions:

I choose to write Python script to handle this changes in file. Is this a right choice, or you can suggest something better?

For now I'm dealing with that as follows:

1) Make workcopy of base file

2) Open workcopy as I/O string in Python:

content = content_file.read()

3) Run this through html.parser object:

ModifyHtmlParser.feed(content)

4) Using overloaded base class methods of HTML parser I'm searching for interesting parts of tags:

def handle_starttag(self, tag, attrs):
    #print("Encountered a start tag:", tag)
    if tag == "tr":
        print("Table row start!")
        offset = self.getpos()
        tagText = self.get_starttag_text()

As a result I'm getting immutable subset of input, mark tags and for now I'm feeling like I'm heading in dead-end... Any ideas on how should I re-work my idea? Any of this particular library could be useful?

Parsing and editing HTML files using Python

Answers (1)

Related Questions