bob_mcsh
bob_mcsh

Reputation: 17

Scrapy works in shell but not when I call my spider

I have been working on this for the past few hours, but cannot figure out what I'm doing wrong. When I run my xpath states using the selector in the scrapy shell, the statement works as expected. When I try to use the same statement in my spider, however, I get back an empty set. Does anyone know what I am doing wrong?

from scrapy.spider import Spider
from scrapy.selector import Selector
from TFFRS.items import Result

class AthleteSpider(Spider):
        name = "athspider"
        allowed_domains = ["www.tffrs.org"]
        start_urls = ["http://www.tffrs.org/athletes/3237431/",]

        def parse(self, response):
            sel = Selector(response)
            results = sel.xpath("//table[@id='results_data']/tr")
            items = []
            for r in results:
                item = Result()
                item['event'] = r.xpath("td[@class='event']").extract()
                items.append(item)
            return items

Upvotes: 1

Views: 1052

Answers (3)

Pawel Miech
Pawel Miech

Reputation: 7822

When viewed by the spider your url contains no content. To debug this kind of problems you should use scrapy.shell.inspect_response in parse method, use it like so:

from scrapy.shell import inspect_response

class AthleteSpider(Spider):

    # all your code    
    def parse(self, response):
        inspect_response(response, self)

then when you do

scrapy crawl <your spider>

you will get a shell from within your spider. There you should do:

In [1]: view(response)

This will display this particular response as it looks for this particular spider.

Upvotes: 4

Iman Mirzadeh
Iman Mirzadeh

Reputation: 13550

Scrapy spiders must implement specific methods; examples are: parse and start_requests but there are others in docs
So if you don't implement these methods for that, you will have problem. In my case the problem was i had a typo and my function name was start_request instead of start_requests!
so make sure your skeleton is something like this:

class MySpider(scrapy.Spider):
    name = "name"
    allowed_domains = ["https://example.com"]
    start_urls = ['https://example.com/']

    def start_requests(self):
        #start_request method

    def parse(self, response):
        #parse method

Upvotes: 0

Abhishek
Abhishek

Reputation: 3068

Try using HtmlXPathSelector for extracting xpaths. Remove http from the start_urls section. Also the table id is something you are not entering correctly in your xpath. Try using inspect element to get a proper xpath for the data you want to scrape.

also consider changing function name, from docs:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work

Upvotes: 0

Related Questions