user1427661
user1427661

Reputation: 11774

Iterating Over Elements and Sub Elements With lxml

This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,

linkList = tree.xpath('div[contains(@class,"cont")]//h3//a')

The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.

lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.

Edit: here's an extremely stripped down version of the HTML I want to parse:

<body>
<div class="cont">
    <h1>Random Text</h1>
    <p>The text I want to obtain</p>
    <h3><a href="somelink">The link I want to obtain</a></h3>
</div>
</body>

There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.

Upvotes: 1

Views: 4228

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122342

You could just use a less specific XPath expression:

for matchingdiv in tree.xpath('div[contains(@class,"cont")]'):
    # skip those without a h3 > a setup.
    link = matchingdiv.xpath('.//h3//a')
    if not link:
        continue

    # grab the `p` text and of course the link.

You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):

for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(@class,"cont")]]'):
    # no need to skip anymore, this is a div.cont with h3 and a contained
    link = matchingdiv.xpath('.//h3//a')

    # grab the `p` text and of course the link

but since you need to then scan for the link anyway that doesn't actually buy you anything.

Upvotes: 3

Related Questions