Scrapy doesn't download data though the xpath is correct

Question

I am crawling data from http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2 (only this page to test my crawler).

items.py

import scrapy

class ShipItem(scrapy.Item):
    name        = scrapy.Field()
    imo         = scrapy.Field()
    category    = scrapy.Field()
    image_urls  = scrapy.Field()
    images      = scrapy.Field()

class CategoryItem(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()

settings.py

BOT_NAME = 'ship'
SPIDER_MODULES = ['ship.spiders']
NEWSPIDER_MODULE = 'ship.spiders'
DOWNLOAD_DELAY = 0.5

spider/shipspider.py

import scrapy
from ship.items import ShipItem

class ShipSpider(scrapy.Spider):

    name = "shipspider"
    allowed_domains = ["shipspotting.com"]
    page_url = "http://www.shipspotting.com"
    start_urls = [
    page_url + "/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2"
]

    def parse(self, response):
        ships = response.xpath('/html/body/center/table/tbody/tr/td[1]/table[1]/tbody/tr/td[2]/div[3]/center/table/tbody/tr/td/table[4]/tbody/tr')

        for ship in ships:
            item = ShipItem()
            item['name'] = ship.xpath('td/center/table[1]/tbody/tr/td[2]/span').extract()[0]

            yield item

spiders/categoryspider.py

import scrapy
from ship.items import CategoryItem

class CategorySpider(scrapy.Spider):
    name = "catspider"
    allowed_domains = ["shipspotting.com"]
    page_url = "http://www.shipspotting.com"
    start_urls = [
        page_url + "/gallery/categories.php"
    ]

    def parse(self, response):
        cats = response.xpath('//td[@class="whiteboxstroke"]/a')
        file = open('categories.txt', 'a')

        for cat in cats:
            item = CategoryItem()

            item['name'] = cat.xpath('img/@title').extract()[0]
            item['link'] = cat.xpath('@href').extract()[0]

            yield item

        file.close()

The catspider runs just perfectly. However, the shipspider doesn't work. It just show the outputs:

2015-06-24 20:15:16+0800 [scrapy] INFO: Scrapy 0.24.6 started (bot: ship)
2015-06-24 20:15:16+0800 [scrapy] INFO: Optional features available: ssl, http11
2015-06-24 20:15:16+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ship.spiders', 'SPIDER_MODULES': ['ship.spiders'], 'DOWNLOAD_DELAY': 0.5, 'BOT_NAME': 'ship'}
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-24 20:15:16+0800 [scrapy] INFO: Enabled item pipelines: 
2015-06-24 20:15:16+0800 [shipspider] INFO: Spider opened
2015-06-24 20:15:16+0800 [shipspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-24 20:15:16+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-24 20:15:16+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-24 20:15:19+0800 [shipspider] DEBUG: Crawled (200)  (referer: None)
2015-06-24 20:15:19+0800 [shipspider] INFO: Closing spider (finished)
2015-06-24 20:15:19+0800 [shipspider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 318,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 477508,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 6, 24, 12, 15, 19, 620358),
     'log_count/DEBUG': 3,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 6, 24, 12, 15, 16, 319378)}
2015-06-24 20:15:19+0800 [shipspider] INFO: Spider closed (finished)

I was wondering whether my xpath is incorrectly. But when I tried to get those elements in Chrome, everything just works correctly.

enter image description here

So, does my shipspider have some subtle problems?

Pawel Miech · Accepted Answer

Browsers add tbody to table elements which is why your xpath works in dev tools but fails with scrapy, this is common gotcha.

Usually you need to find xpath yourself, don't trust automatically generated xpaths they are usually needleesly long. For example to get data about ships you could just use xpath like this

//tr[td[@class='whiteboxstroke']]

to test your xpaths you should use scrapy shell e.g.

> scrapy shell "http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2"
[s] Available Scrapy objects:
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 http://www.shipspotting.com/gallery/search.php?limit=192&limitstart=2112&sortkey=p.lid&sortorder=desc&page_limit=192&viewtype=2>
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: x = "/html/body/center/table/tbody/tr/td[1]/table[1]/tbody/tr/td[2]/div[3]/center/table/tbody/tr/td/table[4]/tbody/tr"

In [2]: response.xpath(x)
Out[2]: []

In [4]: response.xpath("//tr[td[@class='whiteboxstroke']]")
Out[4]: 
[

Scrapy doesn't download data though the xpath is correct

Answers (1)

Related Questions

Scrapy doesn&#39;t download data though the xpath is correct

Answers (1)

Related Questions

Scrapy doesn't download data though the xpath is correct