Geveze
Geveze

Reputation: 391

scrapy xpath solution for xml with type=html and html entities

I am scraping an atom feed (xml). One of the tags says:

<content type="html">
&lt;p&gt Some text and stuff &lt;/p&gt
</content>

Also i see the same html entities for img and a tags. Is there a generic xpath to find the img tag or the p tag like this:

//content/p  or //content/img/@src

But obviously this does not work with these html entities. Or maybe an other solution with scrapy?

Upvotes: 2

Views: 534

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

I think you need to extract content text elements, and for each, parse HTML content using lxml.html

import lxml.etree
import lxml.html

xmlfeed = lxml.etree.fromstring(xmlfeedstring)
for content in xmlfeed.xpath('//content[@type="html"]/text()'):

    htmlcontent = lxml.html.fragment_fromstring(content)
    paragraphs = htmlcontent.xpath('//p')
    image_urls = htmlcontent.xpath('//img/@src')

See Parsing HTML fragments from lxml documentation.

Upvotes: 3

Related Questions