Ycon
Ycon

Reputation: 1950

Conditional URL scraping with Scrapy

I am trying to use Scrapy on a site which I do not know the URL structure of.

I would like to:

When I run the below script, all I get is a random list of URL's

scrapy crawl dmoz>test.txt

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'site.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = "dmoz"
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        for url in response.xpath('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url
            if response.xpath('//div[@class="product-view"]'):
                url = response.extract()
                name = response.xpath('//div[@class="product-name"]/h1/text()').extract()
                price = response.xpath('//span[@class="product_price_details"]/text()').extract()
            yield Request(url, callback=self.parse)
            print url

Upvotes: 2

Views: 1543

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

What you are lookin here is for scrapy.spiders.Crawlspider.

However you almost got it with your own approach. Here's the fixed version.

from scrapy.linkextractors import LinkExtractor
def parse(self, response):
    # parse this page
    if response.xpath('//div[@class="product-view"]'):
        item = dict()
        item['url'] = response.url
        item['name'] = response.xpath('//div[@class="product-name"]/h1/text()').extract_first()
        item['price'] = response.xpath('//span[@class="product_price_details"]/text()').extract_first()
        yield item  # return an item with your data
    # other pages
    le = LinkExtractor()  # linkextractor is smarter than xpath '//a/@href'
    for link in le.extract_links(response):
        yield Request(link.url)  # default callback is already self.parse

Now you can simply run scrapy crawl myspider -o results.csv and scrapy will output csv of your items. Though keep an eye on the log and the stats bit at the end especially, that's how you know if something went wrong

Upvotes: 2

Related Questions