How can I filter images out of HTML Scrapy with XPath?

Question

I'm trying to get the HTML of various articles using Scrapy. These articles also include images that I want to process separately.

If I have an article whose HTML looks like this:


  This is a sentence.
  This is a sentence.

How can I scrape just the non-image HTML, or this:



...but this still includes the images.

kjhughes · Accepted Answer

XPath is for selection, not transformation or rearrangement.

You can select the div elements that have no img children:

//div[@class='article' and not(img)]

or have no img descendents:

//div[@class='article' and not(.//img)]

Or, you can select the contents of the div elements that are p:

//div[@class='article']/p

or that are not img:

//div[@class='article']/*[not(self::img)]

But you cannot select the requested HTML,


  This is a sentence.
  This is a sentence.
  This is a sentence.
  This is a sentence.

because that is a rearrangement, not a selection, of markup that exists in the input document.

Answers (2)