Reputation: 23
I am trying to scrape data from http://rotoguru1.com/cgi-bin/hyday.pl?game=fd. But the code below does not give me any answer. I think something is wrong with the table path or the xpath for the name variable. Could someone please help to identify the problem?
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from kevin.items import RotoguruItem
from scrapy.selector import Selector, HtmlXPathSelector
class RotoguruSpider(CrawlSpider):
name = 'rotoguru1'
allowed_domains = ['rotoguru1.com']
start_urls = ['http://rotoguru1.com/cgi-bin/hyday.pl?']
rules = [Rule(LinkExtractor(allow=['mon=\d+&day=\d+&game=fd']), callback='parse_item')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
tablepath= '//table[@cellspacing="5"]/tbody//tr[position()>2]'
rows = hxs.select(tablepath)
items = []
for row in rows:
item = RotoguruItem()
item['name'] = row.select("td[2]/a/text()").extract()
items.append(item)
return items
The output for the above codes is
H:\SourceDir\basketball>scrapy crawl rotoguru1 -o fd.csv
2015-03-19 18:52:40+0530 [scrapy] INFO: Scrapy 0.24.5 started (bot: basketball)
2015-03-19 18:52:40+0530 [scrapy] INFO: Optional features available: ssl, http11
2015-03-19 18:52:40+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'basketball.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['basketball.spi
ders'], 'FEED_URI': 'fd.csv', 'BOT_NAME': 'basketball'}
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-03-19 18:52:40+0530 [scrapy] INFO: Enabled item pipelines:
2015-03-19 18:52:40+0530 [rotoguru1] INFO: Spider opened
2015-03-19 18:52:40+0530 [rotoguru1] INFO: Crawled 0 pages (at 0 pages/min), scr
aped 0 items (at 0 items/min)
2015-03-19 18:52:40+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-03-19 18:52:40+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-19 18:52:41+0530 [rotoguru1] DEBUG: Crawled (200) <GET http://rotoguru1.
com/cgi-bin/hyday.pl?game=fd> (referer: None)
2015-03-19 18:52:41+0530 [rotoguru1] INFO: Closing spider (finished)
2015-03-19 18:52:41+0530 [rotoguru1] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 17052,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 19, 13, 22, 41, 513000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 19, 13, 22, 40, 296000)}
2015-03-19 18:52:41+0530 [rotoguru1] INFO: Spider closed (finished)
Thanks in advance for helping.
Upvotes: 0
Views: 1772
Reputation: 23806
There are two issues:
1) The URL for start_urls isn't using game=fd
, but the LinkExtractor is expecting game=fd
on its URLs.
2) The XPath used for table is using tbody
, but this is not available in the HTML source parsed by Scrapy (even though this XPath expression works in the browser, because the browser adds tbody automatically).
Adding game=fd
in start_urls, make the tablepath like //table[@cellspacing="5"]//tr[position()>2]
and it should work.
Upvotes: 1