beboy
beboy

Reputation: 103

extract data from nested xpath

I am newbie using xpath, I wanna extract every single title, body, link , release date from this link

everthing seems okay, but no on body, how to extract every single body on nested xPath, thanks before :)

here my source

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from thehack.items import ThehackItem
class MySpider(BaseSpider):
    name = "thehack"
    allowed_domains = ["thehackernews.com"]
    start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath('//article[@class="post item module"]')
        items = []
        for titles in titles:
            item = ThehackItem()
            item['title'] = titles.select('span/h2/a/text()').extract()
            item['link'] = titles.select('span/h2/a/@href').extract()
        item['body'] = titles.select('span/div/div/div/div/a/div/text()').extract()
        item['date'] = titles.select('span/div/span/text()').extract()
            items.append(item)
        return items

anybody can fix about body blok? only on body... thanks before mate here the picture of inspection elements from the website enter image description here

Upvotes: 0

Views: 533

Answers (1)

Wilfredo
Wilfredo

Reputation: 1548

I think you where struggling with the selectors, right? I think you should check the documentation for selectors, there's a lot of good information there. In this particular example, using the css selectors, I think it would be something like:

class MySpider(scrapy.Spider):
    name = "thehack"
    allowed_domains = ["thehackernews.com"]
    start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]

    def parse(self, response):
        for article in response.css('article.post'):
            item = ThehackItem()
            item['title'] = article.css('.post-title>a::text').extract_first()
            item['link'] = article.css('.post-title>a::attr(href)').extract_first()
            item['body'] = ''. join(article.css('[id^=summary] *::text').extract()).strip()
            item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
            yield item

It would be a good exercise for you to change them to xpath selectors and maybe also check about ItemLoaders, together are very useful.

Upvotes: 1

Related Questions