Reputation: 693
I'm trying to scrap top 100 t20 batsmen from icc site however the csv file I'm getting is blank. There are no errors in my code (at least I don't know about them). Here is my item file
import scrapy
class DmozItem(scrapy.Item):
Ranking = scrapy.Field()
Rating = scrapy.Field()
Name = scrapy.Field()
Nationality = scrapy.Field()
Carer_Best_Rating = scrapy.Field()
dmoz_spider file
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "espn"
allowed_domains = ["relianceiccrankings.com"]
start_urls = ["http://www.relianceiccrankings.com/ranking/t20/batting/"]
def parse(self, response):
#sel = response.selector
#for tr in sel.css("table.top100table>tbody>tr"):
for tr in response.xpath('//table[@class="top100table"]/tr'):
item = DmozItem()
item['Ranking'] = tr.xpath('//td[@class="top100id"]/text()').extract_first()
item['Rating'] = tr.xpath('//td[@class="top100rating"]/text()').extract_first()
item['Name'] = tr.xpath('td[@class="top100name"]/a/text()').extract_first()
item['Nationality'] = tr.xpath('//td[@class="top100nation"]/text()').extract_first()
item['Carer_Best_Rating'] = tr.xpath('//td[@class="top100cbr"]/text()').extract_first()
yield item
what is wrong with my code?
Upvotes: 0
Views: 460
Reputation: 976
To answer your ranking problem, the xpath for Ranking starts with '//...' which means 'from the start of the page'. You need it to be relative to tr
instead. Simply remove the '//' from every xpath in the for loop.
item['Ranking'] = tr.xpath('td[@class="top100id"]/text()').extract_first()
Upvotes: 0
Reputation: 5240
The website you're trying to scrap had a frame in it which is the one you want to scrap.
start_urls = [
"http://www.relianceiccrankings.com/ranking/t20/batting/"
]
This is the correct URL
Also there is a lot more stuff wrong going on,
To select elements you should use the response
itself, you don't need to initiate a variable with response.selector
just select it straight from response.xpath(//foo/bar)
Your css selector for the table is wrong. top100table
is a class rather than an id therefore is should be .top100table
and not #top100table
.
Here just have the xpath for it:
response.xpath("//table[@class='top100table']/tr")
tbody
isn't part of the html code, it only appears when you inspect with a modern browser.
extract()
method always returns a list rather then the element itself so you need to extract the first element you find like this:item['Ranking'] = tr.xpath('td[@class="top100id"]/a/text()').extract_first()
Hope this helps, have fun scraping!
Upvotes: 2