Reputation: 75
I spend lot of time trying to scrape information with scrapy without sucess. My goal is to surf through category and for each item scrape title,price and title's href link.
The problem seems to come from the parse_items function. I've check xpath with firepath and I'm able to select the items as wanted, so maybe I just don't catch how xpath are processed by scrapy...
Here is my code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from ..items import electronic_Item
class robot_makerSpider(CrawlSpider):
name = "robot_makerSpider"
allowed_domains = ["robot-maker.com"]
start_urls = [
"http://www.robot-maker.com/shop/",
]
rules = (
Rule(LinkExtractor(
allow=(
"http://www.robot-maker.com/shop/12-kits-robots",
"http://www.robot-maker.com/shop/36-kits-debutants-arduino",
"http://www.robot-maker.com/shop/13-cartes-programmables",
"http://www.robot-maker.com/shop/14-shields",
"http://www.robot-maker.com/shop/15-capteurs",
"http://www.robot-maker.com/shop/16-moteurs-et-actionneurs",
"http://www.robot-maker.com/shop/17-drivers-d-actionneurs",
"http://www.robot-maker.com/shop/18-composants",
"http://www.robot-maker.com/shop/20-alimentation",
"http://www.robot-maker.com/shop/21-impression-3d",
"http://www.robot-maker.com/shop/27-outillage",
),
),
callback='parse_items',
),
)
def parse_items(self, response):
hxs = Selector(response)
products = hxs.xpath("//div[@id='center_column']/ul/li")
items = []
for product in products:
item = electronic_Item()
item['title'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/text()").extract()
item['price'] = product.xpath(
"div/div/div[3]/div/div[1]/span[1]/text()").extract()
item['url'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/@href").extract()
#check that all field exist
if item['title'] and item['price'] and item['url']:
items.append(item)
return items
thanks for your help
Upvotes: 0
Views: 254
Reputation: 21406
The xpaths in your spider are indeed faulty.
Your first xpath for products does work but it's not explicit enough and might fail really easily. While the product detail xpaths are not working at all.
I've got it working with:
products = response.xpath("//div[@class='product-container']")
items = []
for product in products:
item = dict()
item['title'] = product.xpath('.//h2/a/text()').extract_first('').strip()
item['url'] = product.xpath('.//h2/a/@href').extract_first()
item['price'] = product.xpath(".//span[contains(@class,'product-price')]/text()").extract_first('').strip()
All modern websites have very parsing friendly html sources (since they need to parse it themselves for their fancy css styles and javascript functions).
So generally you should look at class and id names of nodes you want to extract with browser inspect tools (right click -> inspect element) instead of using some automated selection tool. it's more reliable and doesn't take much more work once you get the hang of it.
Upvotes: 0