Reputation: 13
I'm trying out Scrapy for first time. After doing fair bit of research I got the basics. Now I was trying to get data of a table. It isn't working. Check below for source codes.
items.py
from scrapy.item import Item, Field
class Digi(Item):
sl = Field()
player_name = Field()
dismissal_info = Field()
bowler_name = Field()
runs_scored = Field()
balls_faced = Field()
minutes_played = Field()
fours = Field()
sixes = Field()
strike_rate = Field()
digicric.py
from scrapy.spider import Spider
from scrapy.selector import Selector
from crawler01.items import Digi
class DmozSpider(Spider):
name = "digicric"
allowed_domains = ["digicricket.marssil.com"]
start_urls = ["http://digicricket.marssil.com/match/MatchData.aspx?op=2&match=1250"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr')
items = []
for site in sites:
item = Digi()
item['sl'] = sel.xpath('td/text()').extract()
item['player_name'] = sel.xpath('td/a/text()').extract()
item['dismissal_info'] = sel.xpath('td/text()').extract()
item['bowler_name'] = sel.xpath('td/text()').extract()
item['runs_scored'] = sel.xpath('td/text()').extract()
item['balls_faced'] = sel.xpath('td/text()').extract()
item['minutes_played'] = sel.xpath('td/text()').extract()
item['fours'] = sel.xpath('td/text()').extract()
item['sixes'] = sel.xpath('td/text()').extract()
item['strike_rate'] = sel.xpath('td/text()').extract()
items.append(item)
return items
Upvotes: 1
Views: 3697
Reputation: 731
I just ran your code with Scrapy and it worked perfectly. What exactly was not working for you?
P.S. This should be a comment but I don't have enough reputation yet... I will edit/close the answer accordingly if necessary.
EDIT:
I think you should to do yield item
at the end of each loop instead of return item
. The rest of your code should be fine.
Here is an example from the Scrapy documentaion:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
Upvotes: 1
Reputation: 473803
The key problem is that you are using sel
inside the loop. The other key problem is that your XPath expressions point to the td
element while you need to get td
elements by index and correlate it with the item
fields.
Working solution:
def parse(self, response):
sites = response.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr')[1:-2]
for site in sites:
item = Digi()
item['sl'] = site.xpath('td[1]/text()').extract()
item['player_name'] = site.xpath('td[2]/a/text()').extract()
item['dismissal_info'] = site.xpath('td[3]/text()').extract()
item['bowler_name'] = site.xpath('td[4]/text()').extract()
item['runs_scored'] = site.xpath('td[5]/b/text()').extract()
item['balls_faced'] = site.xpath('td[6]/text()').extract()
item['minutes_played'] = site.xpath('td[7]/text()').extract()
item['fours'] = site.xpath('td[8]/text()').extract()
item['sixes'] = site.xpath('td[9]/text()').extract()
item['strike_rate'] = site.xpath('td[10]/text()').extract()
yield item
It correctly outputs 11 item instances.
Upvotes: 0