Trying to scrape data from a table without class or id using python scrapy

Question

I am trying to scrape data from http://rotoguru1.com/cgi-bin/hyday.pl?game=fd. But the code below does not give me any answer. I think something is wrong with the table path or the xpath for the name variable. Could someone please help to identify the problem?

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from kevin.items import RotoguruItem
from scrapy.selector import Selector, HtmlXPathSelector


class RotoguruSpider(CrawlSpider):

   name = 'rotoguru1'
   allowed_domains = ['rotoguru1.com']
   start_urls = ['http://rotoguru1.com/cgi-bin/hyday.pl?']
   rules = [Rule(LinkExtractor(allow=['mon=\d+&day=\d+&game=fd']), callback='parse_item')]


   def parse_item(self, response):
       hxs = HtmlXPathSelector(response)
       tablepath= '//table[@cellspacing="5"]/tbody//tr[position()>2]'
       rows = hxs.select(tablepath)
       items = []

       for row in rows:
          item = RotoguruItem()
          item['name'] = row.select("td[2]/a/text()").extract()
          items.append(item)
       return items

The output for the above codes is

H:\SourceDir\basketball>scrapy crawl rotoguru1 -o fd.csv
2015-03-19 18:52:40+0530 [scrapy] INFO: Scrapy 0.24.5 started (bot: basketball)
2015-03-19 18:52:40+0530 [scrapy] INFO: Optional features available: ssl, http11

2015-03-19 18:52:40+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'basketball.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['basketball.spi
ders'], 'FEED_URI': 'fd.csv', 'BOT_NAME': 'basketball'}
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled item pipelines:
2015-03-19 18:52:40+0530 [rotoguru1] INFO: Spider opened
2015-03-19 18:52:40+0530 [rotoguru1] INFO: Crawled 0 pages (at 0 pages/min), scr
aped 0 items (at 0 items/min)
2015-03-19 18:52:40+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-03-19 18:52:40+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080

2015-03-19 18:52:41+0530 [rotoguru1] DEBUG: Crawled (200)  (referer: None)
2015-03-19 18:52:41+0530 [rotoguru1] INFO: Closing spider (finished)
2015-03-19 18:52:41+0530 [rotoguru1] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 236,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 17052,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 3, 19, 13, 22, 41, 513000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 3, 19, 13, 22, 40, 296000)}
2015-03-19 18:52:41+0530 [rotoguru1] INFO: Spider closed (finished)

Thanks in advance for helping.

Elias Dorneles · Accepted Answer

There are two issues:

1) The URL for start_urls isn't using game=fd, but the LinkExtractor is expecting game=fd on its URLs.

2) The XPath used for table is using tbody, but this is not available in the HTML source parsed by Scrapy (even though this XPath expression works in the browser, because the browser adds tbody automatically).

Adding game=fd in start_urls, make the tablepath like //table[@cellspacing="5"]//tr[position()>2] and it should work.

Trying to scrape data from a table without class or id using python scrapy

Answers (1)

Related Questions