muhasturk
muhasturk

Reputation: 2634

Scray CrawlSpider doesn't listen deny rules

I searched for any similar issues on stackowerflow and the other q&a sites but I could not find any proper answer for my problem.

I have written the following spider to crawl nautilusconcept.com . The category structure of site is so bad. Because of it I had to apply rules as it parse all link with callback. I determine which url should be parse with if statement inside parse_item method. Anyway spider doesn't listen my deny rules and still trying to crawl contains (?brw....) links.

Here is my spider;

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from vitrinbot.items import ProductItem
from vitrinbot.base import utils
import hashlib

removeCurrency = utils.removeCurrency
getCurrency = utils.getCurrency

class NautilusSpider(CrawlSpider):
    name = 'nautilus'
    allowed_domains = ['nautilusconcept.com']
    start_urls = ['http://www.nautilusconcept.com/']
    xml_filename = 'nautilus-%d.xml'
    xpaths = {
        'category' :'//tr[@class="KategoriYazdirTabloTr"]//a/text()',
        'title':'//h1[@class="UrunBilgisiUrunAdi"]/text()',
        'price':'//hemenalfiyat/text()',
        'images':'//td[@class="UrunBilgisiUrunResimSlaytTd"]//div/a/@href',
        'description':'//td[@class="UrunBilgisiUrunBilgiIcerikTd"]//*/text()',
        'currency':'//*[@id="UrunBilgisiUrunFiyatiDiv"]/text()',
        'check_page':'//div[@class="ayrinti"]'
    }

    rules = (

        Rule(
            LinkExtractor(allow=('com/[\w_]+',),

                          deny=('asp$',
                                'login\.asp'
                                'hakkimizda\.asp',
                                'musteri_hizmetleri\.asp',
                                'iletisim_formu\.asp',
                                'yardim\.asp',
                                'sepet\.asp',
                                'catinfo\.asp\?brw',
                          ),
            ),
            callback='parse_item',
            follow=True
        ),

    )


    def parse_item(self, response):
        i = ProductItem()
        sl = Selector(response=response)

        if not sl.xpath(self.xpaths['check_page']):
            return i

        i['id'] = hashlib.md5(response.url.encode('utf-8')).hexdigest()
        i['url'] = response.url
        i['category'] = " > ".join(sl.xpath(self.xpaths['category']).extract()[1:-1])
        i['title'] = sl.xpath(self.xpaths['title']).extract()[0].strip()
        i['special_price'] = i['price'] = sl.xpath(self.xpaths['price']).extract()[0].strip().replace(',','.')

        images = []
        for img in sl.xpath(self.xpaths['images']).extract():
            images.append("http://www.nautilusconcept.com/"+img)
        i['images'] = images

        i['description'] = (" ".join(sl.xpath(self.xpaths['description']).extract())).strip()

        i['brand'] = "Nautilus"

        i['expire_timestamp']=i['sizes']=i['colors'] = ''

        i['currency'] = sl.xpath(self.xpaths['currency']).extract()[0].strip()

        return i

Here is the piece of scrapy log

2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=-1&order=&src=&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=name&src=&stock=1&typ=7)

Spider also crawls proper page but it must not try to crawl links that contains (catinfo.asp?brw...)

I'm using Scrapy==0.24.2 and python 2.7.6

Upvotes: 2

Views: 1660

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

It's a canonicalizing "issue". By default, LinkExtractor returns canonicalized URLs, but regexes from deny and allow are applied before canonicalization.

I suggest you use these rules:

rules = (

    Rule(
        LinkExtractor(allow=('com/[\w_]+',),

                      deny=('asp$',
                            'login\.asp',
                            'hakkimizda\.asp',
                            'musteri_hizmetleri\.asp',
                            'iletisim_formu\.asp',
                            'yardim\.asp',
                            'sepet\.asp',
                            'catinfo\.asp\?.*brw',
                      ),
        ),
        callback='parse_item',
        follow=True
    ),

)

Upvotes: 2

Related Questions