scrapy transport start_url to subsequent requests

Question

since three days I am trying to save the respective start_urs in a meta attribute to pass it as item to subsequent requests in scrapy, so I can use the start_url to call a dictionary to populate my output with additional data. Actually it should be straightforward, because it is explained in the documentation ...

There is a discussion in the google scrapy group and there was a question here also but I can't get it to run :(

I am new to scrapy and I think it is a great framework, but for my project I have to know the start_urls of all requests and it is quite complicated as it looks like.

I would really appreciate some help!

At the moment my code looks like this:

class example(CrawlSpider):

    name = 'example'
    start_urls = ['http://www.example.com']

    rules = (
    Rule(SgmlLinkExtractor(allow=('/blablabla/', )), callback='parse_item'),
    )

    def parse(self, response):
        for request_or_item in super(example, self).parse(response):
            if isinstance(request_or_item, Request):
                request_or_item = request_or_item.replace(meta = {'start_url':   response.meta['start_url']})
            yield request_or_item

    def make_requests_from_url(self, url):
         return Request(url, dont_filter=True, meta = {'start_url': url})


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = testItem()
        print response.request.meta, response.url

warvariuc · Accepted Answer

I wanted to delete this answer as it doesn't solve OP's problem, but i thought to leave it as a scrapy example.

Warning:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Use BaseSpider instead:

class Spider(BaseSpider):

    name = "domain_spider"


    def start_requests(self):

        last_domain_id = 0
        chunk_size = 10
        cursor = settings.db.cursor()

        while True:
            cursor.execute("""
                    SELECT domain_id, domain_url  
                    FROM domains  
                    WHERE domain_id > %s AND scraping_started IS NULL  
                    LIMIT %s
                """, (last_domain_id, chunk_size))
            self.log('Requesting %s domains after %s' % (chunk_size, last_domain_id))
            rows = cursor.fetchall()
            if not rows:
                self.log('No more domains to scrape.')
                break

            for domain_id, domain_url in rows:
                last_domain_id = domain_id
                request = self.make_requests_from_url(domain_url)
                item = items.Item()
                item['start_url'] = domain_url
                item['domain_id'] = domain_id
                item['domain'] = urlparse.urlparse(domain_url).hostname
                request.meta['item'] = item

                cursor.execute("""
                        UPDATE domains  
                        SET scraping_started = %s
                        WHERE domain_id = %s  
                    """, (datetime.now(), domain_id))

                yield request

    ...

scrapy transport start_url to subsequent requests

Answers (1)

Related Questions