Scraping Data from a table with Scrapy

I'm trying out Scrapy for first time. After doing fair bit of research I got the basics. Now I was trying to get data of a table. It isn't working. Check below for source codes.

items.py

from scrapy.item import Item, Field

class Digi(Item):

    sl = Field()
    player_name = Field()
    dismissal_info = Field()
    bowler_name = Field()
    runs_scored = Field()
    balls_faced = Field()
    minutes_played = Field()
    fours = Field()
    sixes = Field() 
    strike_rate = Field()

digicric.py

from scrapy.spider import Spider
from scrapy.selector import Selector
from crawler01.items import Digi

class DmozSpider(Spider):
    name = "digicric"
    allowed_domains = ["digicricket.marssil.com"]
    start_urls = ["http://digicricket.marssil.com/match/MatchData.aspx?op=2&match=1250"]

    def parse(self, response):

        sel = Selector(response)
        sites = sel.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr')
        items = []

        for site in sites:
            item = Digi()
            item['sl'] = sel.xpath('td/text()').extract()
            item['player_name'] = sel.xpath('td/a/text()').extract()
            item['dismissal_info'] = sel.xpath('td/text()').extract()
            item['bowler_name'] = sel.xpath('td/text()').extract()
            item['runs_scored'] = sel.xpath('td/text()').extract()
            item['balls_faced'] = sel.xpath('td/text()').extract()
            item['minutes_played'] = sel.xpath('td/text()').extract()
            item['fours'] = sel.xpath('td/text()').extract()
            item['sixes'] = sel.xpath('td/text()').extract()
            item['strike_rate'] = sel.xpath('td/text()').extract()
            items.append(item)
        return items

Upvotes: 1

Views: 3697

Answers (2)

Thomas Hsieh
Thomas Hsieh

Reputation: 731

I just ran your code with Scrapy and it worked perfectly. What exactly was not working for you?

P.S. This should be a comment but I don't have enough reputation yet... I will edit/close the answer accordingly if necessary.

EDIT:

I think you should to do yield item at the end of each loop instead of return item. The rest of your code should be fine.

Here is an example from the Scrapy documentaion:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

Upvotes: 1

alecxe
alecxe

Reputation: 473803

The key problem is that you are using sel inside the loop. The other key problem is that your XPath expressions point to the td element while you need to get td elements by index and correlate it with the item fields.

Working solution:

def parse(self, response):
    sites = response.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr')[1:-2]

    for site in sites:
        item = Digi()
        item['sl'] = site.xpath('td[1]/text()').extract()
        item['player_name'] = site.xpath('td[2]/a/text()').extract()
        item['dismissal_info'] = site.xpath('td[3]/text()').extract()
        item['bowler_name'] = site.xpath('td[4]/text()').extract()
        item['runs_scored'] = site.xpath('td[5]/b/text()').extract()
        item['balls_faced'] = site.xpath('td[6]/text()').extract()
        item['minutes_played'] = site.xpath('td[7]/text()').extract()
        item['fours'] = site.xpath('td[8]/text()').extract()
        item['sixes'] = site.xpath('td[9]/text()').extract()
        item['strike_rate'] = site.xpath('td[10]/text()').extract()
        yield item

It correctly outputs 11 item instances.

Upvotes: 0

Related Questions