Wessi
Wessi

Reputation: 1802

Scrapy DEBUG: Crawled (200)

I'm trying to scrape a webpage using Scrapy and XPath selectors. I've tested my XPath selectors using chrome. It seems my spider crawls zero pages and scrapes 0 items. What can I do to correct it? I get the following output from crawling:

$ scrapy crawl stack
2015-08-24 21:11:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: stack)
2015-08-24 21:11:55 [scrapy] INFO: Optional features available: ssl, http11
2015-08-24 21:11:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'st
ack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'BOT_NAME': 'stack'}
2015-08-24 21:11:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-24 21:11:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 21:11:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 21:11:56 [scrapy] INFO: Enabled item pipelines:
2015-08-24 21:11:56 [scrapy] INFO: Spider opened
2015-08-24 21:11:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-24 21:11:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-24 21:11:56 [scrapy] DEBUG: Crawled (200) <GET http://www.cofman.com/search.php?country=DK#areaid=100001&areatxt=Danmark&country=DK&zoom=6&startDate=2015-08-29&endDate=2015-09-05&fuzzy=false> (referer: None)
2015-08-24 21:11:56 [scrapy] DEBUG: Scraped from <200 http://www.cofman.com/search.php?country=DK>
{'by': [], 'husnr': [], 'periode': [], 'pris': []}
2015-08-24 21:11:56 [scrapy] INFO: Closing spider (finished)
2015-08-24 21:11:56 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6059,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 8, 24, 19, 11, 56, 875000),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 8, 24, 19, 11, 56, 390000)}
2015-08-24 21:11:56 [scrapy] INFO: Spider closed (finished)

This is my spider:

from scrapy import Spider
from scrapy.selector import Selector

from stack.items import StackItem


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["cofman.com"]
    start_urls = [
    "http://www.cofman.com/search.php?country=DK#areaid=100001&areatxt=Danmark&country=DK&zoom=6&startDate=2015-08-29&endDate=2015-09-05&fuzzy=false",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//*[@id="content"]/div[4]')

        for question in questions:
            item = StackItem()
            item['husnr'] = question.xpath(
            '//*[@id="resultListning"]/div/div/div[1]/a/small').extract()
            item['pris'] = question.xpath(
            '//*[@id="resultListning"]/div/div/div[5]/div/div[1]//*/span[@class="formatted_price"]').extract()
            item['by'] = question.xpath(
            '//*[@id="resultListning"]/div/div/div[1]/a/text()').extract()
            item['periode'] = question.xpath(
            '//*[@id="mapNavigation"]/table/tbody/tr/td[1]/div/text()').extract()
            yield item

And lastly my items.py:

from scrapy.item import Item, Field


class StackItem(Item):
    husnr = Field()
    pris = Field()
    by = Field()
    periode = Field()

Upvotes: 3

Views: 9686

Answers (1)

Rejected
Rejected

Reputation: 4491

Scrapy is working fine. However, the page you are trying to scrape fetches its content via Javascript. Scrapy isn't ever getting the content you want to scrape.

>>> Selector(response).xpath('//div[@id="resultListning"]').extract()
[u'<div id="resultListning"></div>']

You'll need to either find out where it's grabbing the data from and grab it from that source, or you'll need to use any of the various methods of rendering JS.

Upvotes: 3

Related Questions