Reputation: 391
I am scraping an atom feed (xml). One of the tags says:
<content type="html">
<p> Some text and stuff </p>
</content>
Also i see the same html entities for img and a tags. Is there a generic xpath to find the img tag or the p tag like this:
//content/p or //content/img/@src
But obviously this does not work with these html entities. Or maybe an other solution with scrapy?
Upvotes: 2
Views: 534
Reputation: 20748
I think you need to extract content
text elements, and for each, parse HTML content using lxml.html
import lxml.etree
import lxml.html
xmlfeed = lxml.etree.fromstring(xmlfeedstring)
for content in xmlfeed.xpath('//content[@type="html"]/text()'):
htmlcontent = lxml.html.fragment_fromstring(content)
paragraphs = htmlcontent.xpath('//p')
image_urls = htmlcontent.xpath('//img/@src')
See Parsing HTML fragments from lxml documentation.
Upvotes: 3