YPCrumble
YPCrumble

Reputation: 28692

How can I filter images out of HTML Scrapy with XPath?

I'm trying to get the HTML of various articles using Scrapy. These articles also include images that I want to process separately.

If I have an article whose HTML looks like this:

<div class="article>
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
  <img src="/path/to/image.jpg"/>
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
</div>

How can I scrape just the non-image HTML, or this:

<div class="article>
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
</div>

I've currently tried:

article = response.xpath("//div[@class='article'][not(img)]").extract()

...but this still includes the images.

Upvotes: 1

Views: 392

Answers (2)

kjhughes
kjhughes

Reputation: 111611

XPath is for selection, not transformation or rearrangement.

You can select the div elements that have no img children:

//div[@class='article' and not(img)]

or have no img descendents:

//div[@class='article' and not(.//img)]

Or, you can select the contents of the div elements that are p:

//div[@class='article']/p

or that are not img:

//div[@class='article']/*[not(self::img)]

But you cannot select the requested HTML,

<div class="article">
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
  <p>This is a sentence.</p>
</div>

because that is a rearrangement, not a selection, of markup that exists in the input document.

Upvotes: 1

NopalByte
NopalByte

Reputation: 159

Try the following code:

article = response.xpath("//div[@class='article']//*[not(self::img)]").extract()

Upvotes: 0

Related Questions