Reputation: 293
I am scraping this site and I'm using Scrapy as the means. However, I am having trouble with the XPath. I'm not entirely sure what is going on:
Why does this work:
def parse_item(self, response):
item = BotItem()
for title in response.xpath('//h1'):
item['title'] = title.xpath('strong/text()').extract()
item['wage'] = title.xpath('span[@class="price"]/text()').extract()
yield item
and the following code not?
def parse_item(self, response):
item = BotItem()
for title in response.xpath('//body'):
item['title'] = title.xpath('h1/strong/text()').extract()
item['wage'] = title.xpath('h1/span[@class="price"]/text()').extract()
yield item
I aim to also extract the XPath for:
//div[@id="description"]/p
But I can't because it is outside the h1
node. How can I achieve this? My full code is:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bot.items import BotItem
class MufmufSpider(CrawlSpider):
name = 'mufmuf'
allowed_domains = ['mufmuf.ro']
start_urls = ['http://mufmuf.ro/locuri-de-munca/joburi-in-strainatate/']
rules = (
Rule(
LinkExtractor(restrict_xpaths='//div[@class="paginate"][position() = last()]'),
#callback='parse_start_url',
follow=True
),
Rule(
LinkExtractor(restrict_xpaths='//h3/a'),
callback='parse_item',
follow=True
),
def parse_item(self, response):
item = BotItem()
for title in response.xpath('//h1'):
item['title'] = title.xpath('strong/text()').extract()
item['wage'] = title.xpath('span[@class="price"]/text()').extract()
#item['description'] = title.xpath('div[@id="descirption"]/p/text()').extract()
yield item
Upvotes: 0
Views: 818
Reputation: 474171
The for title in response.xpath('//body'):
option does not work because your XPath expressions in the loop make it search for h1
element directly inside the body
element.
Moreover, since there is only one desired entity to extract you don't need a loop here at all:
def parse_item(self, response):
item = BotItem()
item["title"] = response.xpath('//h1/strong/text()').extract()
item["wage"] = response.xpath('//h1/span[@class="price"]/text()').extract()
item["description"] = response.xpath('//div[@id="description"]/p/text()').extract()
return item
(this should also answer your second question about the description
)
Upvotes: 4