Reputation: 35
EDIT: THIS HAS BEEN RESOLVED! XPATH WAS THE ISSUE.
I'm very confused. I'm trying to write a very simple spider, to crawl a website (talkbass.com) to get me a list of all the links in the classified bass section (http://www.talkbass.com/forum/f126/ ). I wrote the spider based off of the tutorial (which I completed with relative ease) and this one is just not working. I am probably doing a lot wrong, as I tried to incorporate a Rule as well, but I'm just getting nothing back.
My item code is:
from scrapy.item import Item, Field
class BassItem(Item):
title = Field()
link = Field()
print title, link
my spider code is:
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
class BassSpider(BaseSpider):
name = "bass"
allowed_domains = ["www.talkbass.com"]
start_urls = ["http://www.talkbass.com/forum/f126/"]
rules = (
# Extract links matching 'f126/xxx'
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('/f126/(\d*)/', ), ))
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
ads = hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody')
items = []
for ad in ads:
item = BassItem()
item['title'] = ad.select('a/text()').extract()
item['link'] = ad.select('a/@href').extract()
items.append(item)
return items
I don't get any errors, but the log just doesn't return anything. Here's what I see in the console:
C:\Python27\Scrapy\tutorial>scrapy crawl bass
2013-01-07 14:36:49+0800 [scrapy] INFO: Scrapy 0.16.3 started (bot: tutorial)
2013-01-07 14:36:49+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, Co
reStats, SpiderState
{} {}
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddl
eware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, Htt
pCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, Refe
rerMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-01-07 14:36:51+0800 [bass] INFO: Spider opened
2013-01-07 14:36:51+0800 [bass] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-01-07 14:36:52+0800 [bass] DEBUG: Crawled (200) <GET http://www.talkbass.com/forum/f126/> (referer: None)
2013-01-07 14:36:52+0800 [bass] INFO: Closing spider (finished)
2013-01-07 14:36:52+0800 [bass] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 17997,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 1, 7, 6, 36, 52, 305000),
'log_count/DEBUG': 7,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 1, 7, 6, 36, 51, 458000)}
2013-01-07 14:36:52+0800 [bass] INFO: Spider closed (finished)
I've never done an actual project and I thought this would be a good place to start but I can't seem to straighten this out.
I'm also not sure if the XPath is correct. I'm using a chrome extension called XPath helper. An example of one of the sections I need is this:
/html/body/div[1]/div[@class='page']/div/table[5]/tbody/tr/td[2]/form[@id='inlinemodform']/table[@id='threadslist']/tbody[@id='threadbitsforum126']/tr[6]/td[@id='td_threadtitle944468']/div[1]/a[@id='thread_title944468']
However if you see the " tr[6] " and the "944468" - those are not constant for each link (everything else is). I just removed the class names and the numbers which left me with what you see in my spider code.
Also just to add - when I copy and paste the XPath directly from XPath Helper, it gives a syntax error:
ads = hxs.select('/html/body/div[1]/div[@class='page']/div/table[5]/tbody/tr/td[2]/form[@id='inlinemodform']/table[@id='threadslist']/tbody[@id='threadbits_forum_126']/tr[6]/td[@id='td_threadtitle_944468']/div[1]/a[@id='thread_title_944468']')
^
SyntaxError: invalid syntax
I have tried messing around with that (using wild cards where the elements are not constant) and have been receiving syntax errors each time I try
Upvotes: 2
Views: 1838
Reputation: 7889
One reason that could be causing the issue is that you are using a BaseSpider which does not implement rules.
Try changing BaseSpider to CrawlSpider. You should also rename parse
to something like parse_item
(since CrawlSpider implements a parse
function), which will necessitate explicitly setting a callback in your rule. Eg:
rules = (
# Extract links matching 'f126/xxx'
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('index\d+\.html', )), callback='parse_item'),
)
An updated Xpath to try is as follows. Note that this will include all of the sticky threads, so it's left as an exercise to the OP to work out how to filter out those out.
ads = hxs.select("//td[substring(@id, 1, 15) = 'td_threadtitle_']/div/a")
Upvotes: 1
Reputation: 35
As noted, the XPath was the big problem here. It has been edited to '//table[@id="threadslist"]/tbody/tr/td[@class="alt1"][2]/div'
Upvotes: 0
Reputation: 4085
there are couple of issues in your code.
BaseSpider doesn't support rules
class BassSpider(BaseSpider):
so extend your crawler from CrawlSpider rather then BaseSpider it will start crawling links.
class BassSpider(CrawlSpider):
in extraction part
ads = hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody')
convert your Xpath to relative rather then absolute ,
in most of the cases tbody tag doesn't not present on the source but browser show this while rendering , so Xapth that contains tbody work on browers but not in code ... so i recommend you to not to use tbody tag in xpath .
Upvotes: 0
Reputation: 12410
Besides the fact that your trying to use rules in a BaseSpider which are not supported are you finding any matches with that hxs.select statement? try opening a command prompt, and run scrapy shell http://www.talkbass.com/forum/f126/
then type;
hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody')
then;
hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody/a/text()')
then;
hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody/a/@href')
if you do find a match keeps things simple try;
item = BassItem()
item['title'] = ad.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody/a/text()').extract()
return item
Then in your item pipeline
for i in item['title']:
print i
The bottom line is your hxs.select statement is not correct, so you should always open the shell and test your hxs.select statements until you know you have them right before running your spider.
Upvotes: 1