Reputation: 35

Scrapy headache - trying to debug. No errors but code not working

EDIT: THIS HAS BEEN RESOLVED! XPATH WAS THE ISSUE.

I'm very confused. I'm trying to write a very simple spider, to crawl a website (talkbass.com) to get me a list of all the links in the classified bass section (http://www.talkbass.com/forum/f126/ ). I wrote the spider based off of the tutorial (which I completed with relative ease) and this one is just not working. I am probably doing a lot wrong, as I tried to incorporate a Rule as well, but I'm just getting nothing back.

My item code is:

from scrapy.item import Item, Field
class BassItem(Item):
    title = Field()
    link = Field()
    print title, link

my spider code is:

from scrapy.spider import BaseSpider 
from scrapy.contrib.spiders import Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
class BassSpider(BaseSpider):
    name = "bass"
    allowed_domains = ["www.talkbass.com"]
    start_urls = ["http://www.talkbass.com/forum/f126/"]

    rules = (
        # Extract links matching 'f126/xxx'
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('/f126/(\d*)/', ), ))
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ads = hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody')
        items = []
        for ad in ads:
            item = BassItem()
            item['title'] = ad.select('a/text()').extract()
            item['link'] = ad.select('a/@href').extract()
            items.append(item)
        return items

I don't get any errors, but the log just doesn't return anything. Here's what I see in the console:

C:\Python27\Scrapy\tutorial>scrapy crawl bass
2013-01-07 14:36:49+0800 [scrapy] INFO: Scrapy 0.16.3 started (bot: tutorial)
2013-01-07 14:36:49+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, Co
reStats, SpiderState
{} {}
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddl
eware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, Htt
pCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, Refe
rerMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-01-07 14:36:51+0800 [bass] INFO: Spider opened
2013-01-07 14:36:51+0800 [bass] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-01-07 14:36:51+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-01-07 14:36:52+0800 [bass] DEBUG: Crawled (200) <GET http://www.talkbass.com/forum/f126/> (referer: None)
2013-01-07 14:36:52+0800 [bass] INFO: Closing spider (finished)
2013-01-07 14:36:52+0800 [bass] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 233,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 17997,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 1, 7, 6, 36, 52, 305000),
         'log_count/DEBUG': 7,
         'log_count/INFO': 4,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 1, 7, 6, 36, 51, 458000)}
2013-01-07 14:36:52+0800 [bass] INFO: Spider closed (finished)

I've never done an actual project and I thought this would be a good place to start but I can't seem to straighten this out.

I'm also not sure if the XPath is correct. I'm using a chrome extension called XPath helper. An example of one of the sections I need is this:

/html/body/div[1]/div[@class='page']/div/table[5]/tbody/tr/td[2]/form[@id='inlinemodform']/table[@id='threadslist']/tbody[@id='threadbitsforum126']/tr[6]/td[@id='td_threadtitle944468']/div[1]/a[@id='thread_title944468']

However if you see the " tr[6] " and the "944468" - those are not constant for each link (everything else is). I just removed the class names and the numbers which left me with what you see in my spider code.

Also just to add - when I copy and paste the XPath directly from XPath Helper, it gives a syntax error:

    ads = hxs.select('/html/body/div[1]/div[@class='page']/div/table[5]/tbody/tr/td[2]/form[@id='inlinemodform']/table[@id='threadslist']/tbody[@id='threadbits_forum_126']/tr[6]/td[@id='td_threadtitle_944468']/div[1]/a[@id='thread_title_944468']')
                                                                                                                                                                                       ^
SyntaxError: invalid syntax

I have tried messing around with that (using wild cards where the elements are not constant) and have been receiving syntax errors each time I try

Upvotes: 2

Answers (4)

Talvalin

Reputation: 7889

One reason that could be causing the issue is that you are using a BaseSpider which does not implement rules.

Try changing BaseSpider to CrawlSpider. You should also rename parse to something like parse_item (since CrawlSpider implements a parse function), which will necessitate explicitly setting a callback in your rule. Eg:

rules = (
    # Extract links matching 'f126/xxx'
    # and follow links from them (since no callback means follow=True by default).
    Rule(SgmlLinkExtractor(allow=('index\d+\.html', )), callback='parse_item'),
)

An updated Xpath to try is as follows. Note that this will include all of the sticky threads, so it's left as an exercise to the OP to work out how to filter out those out.

ads = hxs.select("//td[substring(@id, 1, 15) = 'td_threadtitle_']/div/a")

Upvotes: 1

jwl298

Reputation: 35

As noted, the XPath was the big problem here. It has been edited to '//table[@id="threadslist"]/tbody/tr/td[@class="alt1"][2]/div'

Upvotes: 0

akhter wahab

Reputation: 4085

there are couple of issues in your code.

BaseSpider doesn't support rules

class BassSpider(BaseSpider):

so extend your crawler from CrawlSpider rather then BaseSpider it will start crawling links.

class BassSpider(CrawlSpider):

in extraction part

ads = hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody')

convert your Xpath to relative rather then absolute ,

in most of the cases tbody tag doesn't not present on the source but browser show this while rendering , so Xapth that contains tbody work on browers but not in code ... so i recommend you to not to use tbody tag in xpath .

Upvotes: 0

Chris Hawkes

Reputation: 12410

Besides the fact that your trying to use rules in a BaseSpider which are not supported are you finding any matches with that hxs.select statement? try opening a command prompt, and run scrapy shell http://www.talkbass.com/forum/f126/

then type;

hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody')

then;

hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody/a/text()')

then;

hxs.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody/a/@href')

if you do find a match keeps things simple try;

item = BassItem()
item['title'] = ad.select('/html/body/div/div/div/table/tbody/tr/td/form/table/tbody/a/text()').extract()
return item

Then in your item pipeline

for i in item['title']:
    print i

The bottom line is your hxs.select statement is not correct, so you should always open the shell and test your hxs.select statements until you know you have them right before running your spider.

Upvotes: 1

Scrapy headache - trying to debug. No errors but code not working

Answers (4)

Related Questions