Reputation: 283

HTML Parsing - Get Innermost HTML Tags

When I parse HTML I wish to obtain only the innermost tags for the entire document. My intention is to semantically parse data from the HTML doc.

So if I have some html like this

<html>
     <table>
           <tr><td>X</td></tr>
           <tr><td>Y</td></tr>
     </table>
</html>

I want <td>X</td> and <td>Y</td> alone. Is this possible using Beautiful Soup or lxml?

Upvotes: 1

Answers (4)

ATorras

Reputation: 4313

If you can use or DOM handling (i.e. in a browser) you can work with the parentNode attribute for all the tags and recursively count the total, and keep the largest one.

In javascript-pseudocode (tested on FireFox):

var allElements = document.getElementsByTagName("*");
var maxElementReference, maxParentNodeCount = 0;
var i;

for (i = 0; i < allElements.length; i++) {

    var count = recursiveCountParentNodeOn(allElements[i]);

    if (maxParentNodeCount < count) {
        maxElementReference = allElements[i];
        maxParentNodeCount = count;
    }
}

Upvotes: 0

Tomalak

Reputation: 338406

After you made sure your document is well-formed (by parsing it using lxml, for example), you could use XPath to query for all nodes that have no further child elements.

//*[count(*) = 0]

Upvotes: 2

Lucero

Reputation: 60276

That's one of the few situations where you could actually use a Regular Expression to parse the HTML string.

\<(\w+)[^>]*>[^\<]*\</\1\s*>

Upvotes: 0

Paul G.

Reputation: 464

In .NET I've used HtmlAgilityPack library to do all html parsings easy. It loads DOM and you can select by nodes, in your case select nodes with no childs. Maybe that helps.

Upvotes: 3

HTML Parsing - Get Innermost HTML Tags

Answers (4)

Related Questions