Reputation: 17
I have been working on this for the past few hours, but cannot figure out what I'm doing wrong. When I run my xpath states using the selector in the scrapy shell, the statement works as expected. When I try to use the same statement in my spider, however, I get back an empty set. Does anyone know what I am doing wrong?
from scrapy.spider import Spider
from scrapy.selector import Selector
from TFFRS.items import Result
class AthleteSpider(Spider):
name = "athspider"
allowed_domains = ["www.tffrs.org"]
start_urls = ["http://www.tffrs.org/athletes/3237431/",]
def parse(self, response):
sel = Selector(response)
results = sel.xpath("//table[@id='results_data']/tr")
items = []
for r in results:
item = Result()
item['event'] = r.xpath("td[@class='event']").extract()
items.append(item)
return items
Upvotes: 1
Views: 1052
Reputation: 7822
When viewed by the spider your url contains no content. To debug this kind of problems you should use scrapy.shell.inspect_response in parse method, use it like so:
from scrapy.shell import inspect_response
class AthleteSpider(Spider):
# all your code
def parse(self, response):
inspect_response(response, self)
then when you do
scrapy crawl <your spider>
you will get a shell from within your spider. There you should do:
In [1]: view(response)
This will display this particular response as it looks for this particular spider.
Upvotes: 4
Reputation: 13550
Scrapy spiders must implement specific methods; examples are: parse
and start_requests
but there are others in docs
So if you don't implement these methods for that, you will have problem. In my case the problem was i had a typo and my function name was start_request instead of start_requests!
so make sure your skeleton is something like this:
class MySpider(scrapy.Spider):
name = "name"
allowed_domains = ["https://example.com"]
start_urls = ['https://example.com/']
def start_requests(self):
#start_request method
def parse(self, response):
#parse method
Upvotes: 0
Reputation: 3068
Try using HtmlXPathSelector for extracting xpaths.
Remove http
from the start_urls
section. Also the table id is something you are not entering correctly in your xpath. Try using inspect element to get a proper xpath for the data you want to scrape.
also consider changing function name, from docs:
Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work
Upvotes: 0