Reputation: 11774
This one is for legitimate lxml
gurus. I have a web scraping application where I want to iterate over a number of div.content
(content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a>
tags that are the children of <h3>
elements. This seems relatively simple by just trying to create a list using XPath
from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(@class,"cont")]//h3//a')
The problem is, I then want to create a tuple
that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a>
tags.
lxml's Element.iter()
function could ALMOST achieve this by iterating over all of the div.cont
elements, ignoring those without <a>
tags, and pairing up the paragraph/a
combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3><a href="somelink">The link I want to obtain</a></h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.
Upvotes: 1
Views: 4228
Reputation: 1122342
You could just use a less specific XPath expression:
for matchingdiv in tree.xpath('div[contains(@class,"cont")]'):
# skip those without a h3 > a setup.
link = matchingdiv.xpath('.//h3//a')
if not link:
continue
# grab the `p` text and of course the link.
You could expand this (be ambitious) and select for the h3
> a
tags, then go to the div.cont
ancestor (based off XPath query with descendant and descendant text() predicates):
for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(@class,"cont")]]'):
# no need to skip anymore, this is a div.cont with h3 and a contained
link = matchingdiv.xpath('.//h3//a')
# grab the `p` text and of course the link
but since you need to then scan for the link anyway that doesn't actually buy you anything.
Upvotes: 3