Tomas
Tomas

Reputation: 3446

Parsing and editing HTML files using Python

Issue is following: Got some basic HTML auto-generated file as a dump from object database. It's table-based information. The structure of file it's same for each generation, generally coherent content. I have to process this file further, do some remarks, etc, thus I wish to edit a bit this HTML file to let's say add extra table cell with writeable text field to add remarks in file and maybe some final button to generate some additional output. Now the questions:

I choose to write Python script to handle this changes in file. Is this a right choice, or you can suggest something better?

For now I'm dealing with that as follows:

1) Make workcopy of base file

2) Open workcopy as I/O string in Python:

content = content_file.read()

3) Run this through html.parser object:

ModifyHtmlParser.feed(content)

4) Using overloaded base class methods of HTML parser I'm searching for interesting parts of tags:

def handle_starttag(self, tag, attrs):
    #print("Encountered a start tag:", tag)
    if tag == "tr":
        print("Table row start!")
        offset = self.getpos()
        tagText = self.get_starttag_text()

As a result I'm getting immutable subset of input, mark tags and for now I'm feeling like I'm heading in dead-end... Any ideas on how should I re-work my idea? Any of this particular library could be useful?

Upvotes: 0

Views: 4395

Answers (1)

Ming
Ming

Reputation: 1693

I would recommend you use the following general approach.

  1. Load and parse the HTML into a convenient in-memory tree representation using any of the existing libraries for such tasks.
  2. Find relevant nodes in the tree. (Most libraries from part 1 will provide some form of XPath and/or CSS selectors. Both allow you to find all nodes which satisfy a particular rule. In your case, the rule is probably "tr which ...".)
  3. Process the found nodes individually (most libraries from part 1 will let you edit the tree in-place).
  4. Write out either modified tree or newly generated tree.

Here is one particular example for how you could implement the above. (The exact choice of libraries is somewhat flexible. You have multiple options here.)

  1. There's multiple options for HTML parsing and representation library. Most common recommendation I hear these days is LXML.
  2. LXML provides both CSS selector support and XPath support.
  3. See LXML etree documentation.

Upvotes: 1

Related Questions