Reputation: 28692
I'm trying to get the HTML of various articles using Scrapy. These articles also include images that I want to process separately.
If I have an article whose HTML looks like this:
<div class="article>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<img src="/path/to/image.jpg"/>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
</div>
How can I scrape just the non-image HTML, or this:
<div class="article>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
</div>
I've currently tried:
article = response.xpath("//div[@class='article'][not(img)]").extract()
...but this still includes the images.
Upvotes: 1
Views: 392
Reputation: 111611
XPath is for selection, not transformation or rearrangement.
You can select the div
elements that have no img
children:
//div[@class='article' and not(img)]
or have no img
descendents:
//div[@class='article' and not(.//img)]
Or, you can select the contents of the div
elements that are p
:
//div[@class='article']/p
or that are not img
:
//div[@class='article']/*[not(self::img)]
But you cannot select the requested HTML,
<div class="article">
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
</div>
because that is a rearrangement, not a selection, of markup that exists in the input document.
Upvotes: 1
Reputation: 159
Try the following code:
article = response.xpath("//div[@class='article']//*[not(self::img)]").extract()
Upvotes: 0