Reputation: 283
When I parse HTML I wish to obtain only the innermost tags for the entire document. My intention is to semantically parse data from the HTML doc.
So if I have some html like this
<html>
<table>
<tr><td>X</td></tr>
<tr><td>Y</td></tr>
</table>
</html>
I want <td>X</td>
and <td>Y</td>
alone. Is this possible using Beautiful Soup or lxml?
Upvotes: 1
Views: 759
Reputation: 4313
If you can use or DOM handling (i.e. in a browser) you can work with the parentNode attribute for all the tags and recursively count the total, and keep the largest one.
In javascript-pseudocode (tested on FireFox):
var allElements = document.getElementsByTagName("*");
var maxElementReference, maxParentNodeCount = 0;
var i;
for (i = 0; i < allElements.length; i++) {
var count = recursiveCountParentNodeOn(allElements[i]);
if (maxParentNodeCount < count) {
maxElementReference = allElements[i];
maxParentNodeCount = count;
}
}
Upvotes: 0
Reputation: 338406
After you made sure your document is well-formed (by parsing it using lxml, for example), you could use XPath to query for all nodes that have no further child elements.
//*[count(*) = 0]
Upvotes: 2
Reputation: 60276
That's one of the few situations where you could actually use a Regular Expression to parse the HTML string.
\<(\w+)[^>]*>[^\<]*\</\1\s*>
Upvotes: 0
Reputation: 464
In .NET I've used HtmlAgilityPack library to do all html parsings easy. It loads DOM and you can select by nodes, in your case select nodes with no childs. Maybe that helps.
Upvotes: 3