Iterating Over Elements and Sub Elements With lxml

Question

This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any tags that are the children of

elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,

linkList = tree.xpath('div[contains(@class,"cont")]//h3//a')

The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the tags.

lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.

Edit: here's an extremely stripped down version of the HTML I want to parse:



    Random Text
    The text I want to obtain
    The link I want to obtain

There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.

Martijn Pieters · Accepted Answer

You could just use a less specific XPath expression:

for matchingdiv in tree.xpath('div[contains(@class,"cont")]'):
    # skip those without a h3 > a setup.
    link = matchingdiv.xpath('.//h3//a')
    if not link:
        continue

    # grab the `p` text and of course the link.

You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):

for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(@class,"cont")]]'):
    # no need to skip anymore, this is a div.cont with h3 and a contained
    link = matchingdiv.xpath('.//h3//a')

    # grab the `p` text and of course the link

but since you need to then scan for the link anyway that doesn't actually buy you anything.

Iterating Over Elements and Sub Elements With lxml

Answers (1)

Related Questions