Dima
Dima

Reputation: 73

Scrapy crawling same page over and over again for different urls on german site

I am trying to extract info on flats/rooms from a German site called WG-Gesucht. I kinda figured out that their links follow the logic:

http:// www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.**X**.html`

where X=0, 1, 2, ...

When I paste the links into my browser, they do work perfectly. However my optimism was shattered when I tried crawling those links. In the end I only get entries corresponding to X = 0 in my database.

Here is my spider:

from scrapy.http.request import Request
from scrapy.spider import Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose

from scraper_app.items import WGGesuchtEntry




class WGGesuchtSpider(Spider):
    """Spider for wg-gesucht.de, Berlin"""
    name = "wggesucht"
    allowed_domains = ["wg-gesucht.de"]
    start_urls = ["http://www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.0.html"]
    # start_urls = ["http://www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.%s.html"%x for x in range(0,1)]


    entries_list_xpath = '//tr[contains(@id,"ad--")]'
    item_fields = {
        # 'title': './/span[@itemscope]/meta[@itemprop="name"]/@content',
        'rooms': './/td[2]/a/span/text()',
        'entry_date': './/td[3]/a/span/text()',
        'price': './/td[4]/a/span/b/text()',
        'size': './/td[5]/a/span/text()',
        'district': './/td[6]/a/span/text()',
        'start_date': './/td[7]/a/span/text()',
        'end_date': './/td[8]/a/span/text()',
        'link': './/@adid'
    }

    def start_requests(self):
        for i in xrange(1, 10):
            url = 'http://www.wg-gesucht.de/wohnungen-in-Berlin.8.2.0.' + str(i) + '.html'
            yield Request(url=url, callback=self.parse_items)


    def parse_items(self, response):
        """
        Default callback used by Scrapy to process downloaded responses

        # Testing contracts:
        # @url http://www.livingsocial.com/cities/15-san-francisco
        # @returns items 1
        # @scrapes title link

        """
        selector = HtmlXPathSelector(response)

        # iterate over deals
        for entry in selector.xpath(self.entries_list_xpath):
            loader = XPathItemLoader(WGGesuchtEntry(), selector=entry)

            # define processors
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()

            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
            yield loader.load_item()         

Should I maybe use the CrawlSpider instead of Spider?

Any suggestions are mostly welcome, thank you!

Upvotes: 4

Views: 2016

Answers (1)

eLRuLL
eLRuLL

Reputation: 18799

looks like a cookies problems, you can check that by opening a new browser and trying directly the 6th page for example, you are going to receive the response of the first page.

Scrapy tries to use cookies for subsequent requests, so one way of solving this would be not iterating the requests to the page, but making one after the other like:

import re

start_urls = [http://example.com/0.html]

def parse(self, response):
    cur_index = response.meta.get('cur_index', 1)
    ...
    new_url = # use the response.url to change to the following url (+1 to the index)
    if cur_index < 10:
        yield Request(new_url, callback=self.parse, meta={'cur_index': cur_index+1})

Upvotes: 4

Related Questions