Scrapy unable to scrape items, xpath not working

I spend lot of time trying to scrape information with scrapy without sucess. My goal is to surf through category and for each item scrape title,price and title's href link.

The problem seems to come from the parse_items function. I've check xpath with firepath and I'm able to select the items as wanted, so maybe I just don't catch how xpath are processed by scrapy...

Here is my code

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from ..items import electronic_Item


class robot_makerSpider(CrawlSpider):
    name = "robot_makerSpider"
    allowed_domains = ["robot-maker.com"]
    start_urls = [
        "http://www.robot-maker.com/shop/",
    ]

    rules = (

        Rule(LinkExtractor(
            allow=(
                "http://www.robot-maker.com/shop/12-kits-robots",
                "http://www.robot-maker.com/shop/36-kits-debutants-arduino",
                "http://www.robot-maker.com/shop/13-cartes-programmables",
                "http://www.robot-maker.com/shop/14-shields",
                "http://www.robot-maker.com/shop/15-capteurs",
                "http://www.robot-maker.com/shop/16-moteurs-et-actionneurs",
                "http://www.robot-maker.com/shop/17-drivers-d-actionneurs",
                "http://www.robot-maker.com/shop/18-composants",
                "http://www.robot-maker.com/shop/20-alimentation",
                "http://www.robot-maker.com/shop/21-impression-3d",
                "http://www.robot-maker.com/shop/27-outillage",
                ),
            ),
            callback='parse_items',
        ),
    )


    def parse_items(self, response):
        hxs = Selector(response)
        products = hxs.xpath("//div[@id='center_column']/ul/li")
        items = []

        for product in products:
            item = electronic_Item()
            item['title'] = product.xpath(
                "li[1]/div/div/div[2]/h2/a/text()").extract()
            item['price'] = product.xpath(
                "div/div/div[3]/div/div[1]/span[1]/text()").extract()
            item['url'] = product.xpath(
                "li[1]/div/div/div[2]/h2/a/@href").extract()
            
            #check that all field exist
            if item['title'] and item['price'] and item['url']:
                items.append(item)
        return items

thanks for your help

Upvotes: 0

Views: 254

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21406

The xpaths in your spider are indeed faulty.

Your first xpath for products does work but it's not explicit enough and might fail really easily. While the product detail xpaths are not working at all.

I've got it working with:

products = response.xpath("//div[@class='product-container']")
items = []

for product in products:
    item = dict()
    item['title'] = product.xpath('.//h2/a/text()').extract_first('').strip()
    item['url'] = product.xpath('.//h2/a/@href').extract_first()
    item['price'] = product.xpath(".//span[contains(@class,'product-price')]/text()").extract_first('').strip()

All modern websites have very parsing friendly html sources (since they need to parse it themselves for their fancy css styles and javascript functions).

So generally you should look at class and id names of nodes you want to extract with browser inspect tools (right click -> inspect element) instead of using some automated selection tool. it's more reliable and doesn't take much more work once you get the hang of it.

Upvotes: 0

Related Questions